practical glossing by prioritised tiling - シャープ株 … · practical glossing by prioritised...

Practical Glossing by Prioritised Tiling

Victor Poznanski, Pete Whitelock, Jan IJdens, Steffan CorleySharp Laboratories of Europe Ltd.

Oxford Science Park, Oxford, OX4 4GAUnited Kingdom

{vp,pete,jan,steffan}@sharp.co.uk

AbstractWe present the design of a practicalcontext-sensitive glosser, incorporatingcurrent techniques for lightweightlinguistic analysis based on large-scalelexical resources. We outline a generalmodel for ranking the possible translationsof the words and expressions that make upa text. This information can be used by asimple resource-bounded algorithm, ofcomplexity O(n log n) in sentence length,that determines a consistent gloss of besttranslations. We then describe how theresults of the general ranking model maybe approximated using a simple heuristicprioritisation scheme. Finally we present apreliminary evaluation of the glosser’sperformance.

1 Introduction

In a lexicalist MT framework such as Shake-and-Bake (Whitelock, 1994), translationequivalence is defined between collections of(suitably constrained) lexical material in thetwo languages. Such an approach has beenshown to be effective in the description ofmany types of complex bilingual equivalence.However, the complexity of the associatedparsing and generation phases leaves a systemof this type some way from commercialexploitation. The parsing phase that is neededto establish adequate constraints on the wordsis of cubic complexity, while the most generalgeneration algorithm, needed to order thewords in the target text, is O(n4) (Poznanski etal. 1996). In this paper, we show how a novelapplication domain, glossing, can be exploredwithin such a framework, by omitting

generation entirely and replacing syntacticparsing by a simple combination ofmorphological analysis and tagging. Thepoverty of constraints established in this way,and the consequent inaccuracy in translation, ismitigated by providing a menu of alternativesfor each gloss. The gloss is automaticallyupdated in the light of user choices. While theavailability of alternatives is generallydesirable in automatic translation, it is thelimitation to glossing which makes it feasibleto manage the consistency maintenancerequired.

Glossing as a technique for elucidating thegrammar and lexis of a second language text iswell-known from the linguistics literature.Each morpheme in the object language isprovided with its meta-language equivalentaligned beneath it. Such a glosser may be usedas a tool for second-language improvement(Nerbonne and Smit, 1996), and thus providean educational alternative to the passiveconsumption of a (usually low quality)translation. We envisage the glosser’s primaryuse as a tool for cross-language informationgathering, and thus think it best not to displaygrammatical information. Our glosserimproves on the use of printed or even on-linedictionaries in several ways:• The system performs lemmatisation for the

user.• Lightweight analysis resolves part-of-

speech ambiguities in context.• Multi-word expressions, including

discontinuous and variable ones, aredetected.

• A degree of consistency between systemand user choices is maintained.

Figure 1: An English to Japanese Gloss

The glosser attempts to find all plausibleequivalents for the words and multi-wordexpressions that constitute a text, displaying themost appropriate consistent subset as its firstchoice and the remainder within menus.Consistency is maintained by treating sourcelanguage lexical material as resources that areconsumed by the matching of equivalences, so

that the latter partially tile the text1. Our modelhas much in common with that of Alshawi(1996), though our linguistic representations arerelatively impoverished. Our aim is not truetranslation but the use of large existing bilinguallexicons for very wide-coverage glossing. Wehave discovered that the effect of tiling with alarge ordered set of detailed equivalences is toprovide a close approximation to richer schemesfor syntactic analysis.

An example English-Japanese gloss as producedby our system is shown in Figure 1. Multi-word

1 Equivalences are not only consumers of sourcelanguage resources but also producers of targetlanguage ones. In glossing, the production of targetlanguage resources need not be complete – everyword needs a translation, but not every word needs agloss. Tiling thus need only be partial.

collocations are underlined and discontinuousones are also given a number (and colour) tofacilitate identification. Note how stemmed …from is a discontinuous collocation surroundingthe continuous collocation in part. The pop-upmenu shows the alternatives for fruit, by sense atthe top-level with run-offs to synonyms, and atthe bottom an option to access the machine-readable version of ‘Genius’, a publishedEnglish Japanese dictionary.

The structure of this paper is as follows. In 2.1we outline the basic operation of the system,introducing our representation of naturallanguage collocations as key descriptors, andgive a probabilistic interpretation for these in 2.2.Section 3 describes the algorithm for tiling asentence using key descriptors, and goes on todescribe a series of heuristics whichapproximate the full probabilistic model. Section4 presents the results of a preliminary evaluationof the glosser’s performance. Finally in section 5we give our conclusions and make somesuggestions for future improvements to thesystem.

2 A Basic Model of a Glosser

To gloss a text, we first segment it intosentences and use the POS tag probabilitiesassigned by a bigram tagger to order the resultsof morphological analysis. We obtain a completetag probability distribution by using theForwards-Backwards algorithm (see Charniak,1993) and eliminate only those tags whoseprobability falls below a certain threshold. Eachmorphological analysis compatible with one ofthe remaining tags is passed on to the next phase,together with its associated tag probabilities.

The next phase identifies source words andcollocations by matching them against keydescriptors, which are variable length, possiblydiscontinuous, word or morpheme n-grams. Akey descriptor is written:

W1_R1 <d1> W2_R2 <d2> … <dn-1> Wn_Rn

where Wi_Ri means a word Wi with morpho-syntactic restrictions Ri, and Wi_Ri <di>Wi+1_Ri+1 means Wi+1_Ri+1 must occur within di

words to the right of Wi_Ri. For example, a keydescriptor intended to match the collocation in afragment like a procedure used by manyresearchers for describing the effects … mightbe:

procedure_N <5> for_PREP <1> +ing_V0

2.1 Collocations and Key Descriptors

We posit the existence of a collocation whenevertwo or more words or morphemes occur in afixed syntactic relationship more frequently thanwould be expected by chance, and which areideally translated together.

As a linguistic representation of collocations,key descriptors are clearly inadequate. A morecorrect representation would characterise thestretches spanned by the <di> as being ofcertain categories, or better, that the Wi form aconnected piece of dependency representation.However, by:

• expanding the notion of collocation toinclude a variety of closed-class morphemes,

• refining morpho-syntactic restrictions withinthe limitations of our current architecture,

• using a very thorough dictionary of suchcollocations, and

• prioritising key descriptors and using theirelements as consumable resources,

we find that the application of key descriptorsgives a satisfactory approximation to plausibledependency structures.

Two major carriers of syntactic dependencyinformation in language are category/word-orderand closed class elements. Our notion ofcollocation embraces the full array of closed-class elements that may be associated with aword in a particular dependency structure. Thisincludes governed prepositions and adverbialparticles, light verbs, infinitival markers andbound elements such as participial, tense andcase affixes. The morphological analysis phaserecognises the component structure of complexwords and splits them into resources that may beconsumed independently.

Those aspects of dependency structure that arenot signalled collocationally are oftenrecognisable from particular category sequencesand thus can be detected by an n-gram tagger.For instance, in English, transitivity is notmarked by case or adposition, but by theimmediate adjacency of predicate and nounphrase. By distinguishing transitive andintransitive verb tags, we provide furtherconstraints to narrow the range of dependencystructures.

2.2 A Probabilistic Characterisation ofCollocation

Key descriptors require prioritisation for thetiling phase. In order to effect this, we associatea probabilistic ranking function, fkd, with eachkey descriptor kd.

Consider a collocation such as an Englishtransitive phrasal verb, e.g. make up. We maycollect all the instances where the componentwords occur in a sentence in this order withappropriate constraints. By classifying each as apositive or negative instance of this collocation

(in any sense), we can estimate a probabilitydistribution )(__ df ADVupdVTmake ><

over the number

of words, d, separating the elements of thiscollocation. Suppose then that the tagger hasassigned tag probability distributions s

makeP ands

upP to the two elements separated by d words in

a text fragment, s. The probability that the keydescriptor make_VT <d> up_ADV correctlymatches s is given by:

)()()(

),'__('

'__' dfADVPVTP

sADVupdVTmakeP

ADVupdVTmakes

ups

make ⋅⋅

≡

More generally,

nnn

nkdn

ns

w

rwddrwdrwkd

dddfrPskdPn

___

where

),,()(),(

1222111

1211

:(1)Eqn

−

−

≡

⋅

≅ ∏

A typical graph of f for the phrasal verb case isdepicted in Figure 2. In such cases, we observethat the probability falls slowly over the space ofa few words and then sharply at a given d. Inother cases, the slope is gentler, but for the vastmajority of collocations it decreasesmonotonically.

separation, d

probabilitycorrectmatches, f

Figure 2: A Typical Frequency Distribution for aVerb Particle Collocation

The overall downward trend in f can beattributed to the interaction of two factors. Onthe one hand, the total number of true instancesfollows the distribution of length of phrases thatmay intervene (in the case of make up, nounphrases), i.e. it falls with increasing separation.On the other, the absolute number of falseinstances remains relatively constant as d varies,

and thus increases as a proportion of the total.The fall in true instances is accentuated by thetendency for languages to order dependentphrases with the smallest ones nearest to thehead2, and is thus most marked in the phrasalverb case.

As the number of elements in the equivalencegoes up, so does the dimensionality of thefrequency distribution. While the multiplied tagprobabilities must decrease, the f values increasemore , since the corpus evidence tells us that amatch comprising more elements is nearlyalways the correct one.

In section 3.3, we show how we heuristicallyapproximate the various features of f.

3 Glossing as Resource-bounded,Prioritised, Partial Tiling

We prioritise key descriptors to reflect theirappropriateness. We then use this ordering to tilethe source sentence with a consistent set of keydescriptors, and hence their translations. Thefollowing sections describe the algorithm.

3.1 General Algorithm

The bilingual equivalences are treated as asimple “one-shot” production system, whichannotates a source analysis with all of thepossible translations. The tiling algorithm selectsthe best of these translations by treatingbilingual equivalences as consumers competingfor a resource (the right to use a word as part ofa translation). In order to make the systemefficient, we avoid a global view of linguisticstructure. Instead, we assume that everyequivalence carries enough information with itto decide whether it has the right to lock (claim)a resource. Competing consumers are simplycompared in order to decide which has priority.To support this algorithm, it is necessary toassociate with every translation a justification –the source items from which the target item wasderived.

2 This observation has been extensively explored (ina phrase structure framework) by Hawkins (1994).

b := list of words;

ls := set of consumers;

lc := sort(ls, b, priority_fn);

for s in lcdo

words := justifications(s);if resources_free(words)

thenlock_resources(words)mark_as_best(s)

end ifdone

result := empty list;for s in lc

if marked_as_best(s)append(s, result);

return result

the words in thesentence

successfully applied bilingual equivalences

sort consumers according topriority_fn

the words from which theequivalence was derived

have the words been claimed bya bilingual equivalence?

mark the words as consumed

mark bilingual equivalence asbest translation fragment

collect and return besttranslations

Figure 3: Partial Tiling Algorithm

The algorithm for determining the set of besttranslations or translation fringe is portrayed inFigure 3. The consumers are sorted into priorityorder and progressively lock the availableresources. At the end of this process, thebilingual equivalences that have successfullylocked resources comprise the fringe.

3.2 Complexity

We index each bilingual equivalence bychoosing the least frequent source word as a key.We retrieve all bilingual equivalences indexedby all the words in a sentence. Retrieval on eachkey is more or less constant in time. The totalnumber of equivalences retrieved is proportionalto the sentence length, n, and their individualapplications are constant in time. Thus, thecomplexity of the rule application phase is ordern. The final phase (the algorithm of Figure 3) isfundamentally a sorting algorithm. Since eachphase is independent, the overall complexity isbounded to that of sorting, order n log n.

This algorithm does not guarantee to fully tilethe input sentence. If full tiling were desired, atractable solution is to guarantee that every wordhas at least one bilingual equivalence with a

single word key descriptor. However, as will beapparent from Figure 1, glossing the commonestand most ambiguous words would obscure theclarity of the gloss and reduce its precision.

The algorithm as presented operates on sourcelanguage words in their entirety. Morphologicalanalysis introduces a further complexity bysplitting a word into component morphemes,each of which can be considered a resource. Thealgorithm can be adapted to handle this byensuring that a key descriptor locks a reading aswell as the component morphemes. Once areading is locked, only morphemes within thatreading can be consumed.

3.3 Prioritising Equivalences

If the probabilistic ranking function, f, wereelicited by means of corpus evidence, theprioritisation of equivalences would fall outnaturally as the solutions to equation 1. In thissection, we show how a sequence of simpleheuristics can approximate the behaviour of theequation.

We first constrain equivalences to apply onlyover a limited distance (the search radius),

which we currently assume is the same for alldiscontinuous key descriptors. This correspondsapproximately to the steep fall in the casesillustrated in Figure 2.

After this, we sort the equivalences that haveapplied according to the following criteria:

1. baggability2. compactness3. reading4. rightmostness5. frequency priority

Baggability is the number of source wordsconsumed by an equivalence. For instance, inthe fragment … make up for lost time … , weprefer make up for (= compensate) over make up(= reconcile, apply cosmetics, etc). We indicatedin section 2.2 that baggability is generallycorrect.

However, baggability incorrectly models allvalues of f in n-dimensional space as higher thanany value in n-1 dimensional space. In a phraselike formula milk for crying babies, baggabilitywill prefer formula for ... ing to formula milk.

Compactness prefers collocations that span asmaller number of words. Consider the fragment…get something to eat… Assume something toand get to are collocations. The span ofsomething to is 2 words and the span of get to is3. Given that their baggability is identical, weprefer the most compact, i.e. the one with theleast span. In this case, we correctly prefersomething to, though we will go wrong in thecase of get someone to eat. Compactness modelsthe overall downward trend of f.

Reading priority models the tagger probabilitiesof equation 1. Of course, placing this here in theordering means that tagger probabilities neveroverride the contribution of f. There are manycases where this is not accurate, but its effect ismitigated by the use of a threshold for tagprobabilities – very unlikely readings are prunedand therefore unavailable to the key descriptormatching process.

Reading priority orders equivalences whichdiffer only in the categories they assign to thesame words. For instance, in the fragment theway to London, the key descriptor way_N <1>to_PREP (= road to) will be preferred overway_N <1> to_TO (= method of) since theprobability of the latter POS for to will be lower.

Rightmostness describes how far to the right anexpression occurs in the sentence. All othercriteria being equal, we prefer the rightmostexpression on the grounds that English tends tobe right-branching.

Frequency priority picks out a singleequivalence from those with the same keydescriptor, which is intended to represent itsmost frequent sense, or at least its most generaltranslation.

4 Evaluation

The above algorithm is implemented in the SIDsystem for glossing English into Japanese3. Alarge dictionary from an existing MT systemwas used as the basis for our dictionary, whichcomprises about 200k distinct key descriptorskeying about 400k translations. SID reaches apeak glossing speed of about 12,000 words perminute on a 200 MHz Pentium Pro.

To evaluate SID we compared its output with a 1million word dependency-parsed corpus (basedon the Penn TreeBank) and rated as correct anycollocation which corresponded to a connectedpiece of dependency structure with matchingtags. We added other correctness criteria to copewith those cases where a collocate is notdependency-connected in our corpus, such as asubject-main verb collocate separated by anauxiliary (a rally was held), or a discontinuousadjective phrase (an interesting man to know).Correctness is somewhat over-estimated in that adependent preposition, for example, may nothave the intended collocational meaning (itmarks an adjunct rather than an argument), but

3 Available in Japan as part of Sharp’s Power E/Jtranslation package on CD-ROM for Windows® 95.A trial version is available for download athttp://www.sharp.co.jp/sc/excite/soft_map/ej-a.htm

this appears to be more than offset by tagmismatch cases which might be significant butare not in many particular cases – e.g. GrandJury where Grand may be tagged ADJ by SIDbut NP in Penn, or passed the bill on to theHouse, where on may be tagged ADV by SIDbut IN (= preposition) in Penn.

To obtain a baseline recall figure we ran SIDover the corpus with a much lower tagprobability threshold and much higher searchradius4, and counted the total number of correctcollocations detected anywhere amongst thealternatives.

SID detected a total of c. 150k collocations withits parameters set to their values in the releasedversion5, of which we judged 110k correct for anoverall precision of 72%, which rises to 82% forfringe elements. Overall recall was 98% (75%for the fringe). These figures indicate that theuser would have to consult the alternatives fornearly a fifth of collocations (more if weconsider sense ambiguities), but would fail tofind the right translation in only 2% of cases.

Preliminary inspection of the evaluation resultson a collocation by collocation basis revealslarge numbers of incorrect key descriptors whichcould be eliminated, adjusted or furtherconstrained to improve precision with little lossof recall. This leads us to believe that a fringeprecision figure of 90% or so might representthe achievable limit of accuracy using ourcurrent technology.

5 Conclusion

We have described an efficient and lightweightglossing system that has been used in Sharpproducts. It is especially useful for quickly“gisting” web and email documents. With a littleeffort, the user can display the correct translationfor the vast majority of the items in a document.

In future work, we hope to approximate moreclosely the full probabilistic prioritisation modeland otherwise improve the key descriptor

4 threshold 1%, radius 125 threshold 4%, radius 5

language, leading to more accurate analysis. Wewill also explore techniques for extractingcollocations from monolingual and bilingualcorpora, thereby improving the coverage of thesystem.

Acknowledgements

We would like to thank our colleagues withinSharp, particularly Simon Berry, Akira Imai, IanJohnson, Ichiko Sata and Yoji Fukumochi.

References

Alshawi, H. (1996) Head automata andbilingual tiling: translation with minimalrepresentations. Proceedings of the 34th ACL,Santa Cruz, California.

Charniak, E. (1993) Statistical LanguageLearning. MIT Press.

Hawkins, John. (1994) A Performance Theory ofOrder and Constituency. Cambridge Studies inLinguistics 73, Cambridge University Press.

Nerbonne, John and Petra Smit (1996) Glosser-RuG: in Support of Reading. In Proceedings of16th COLING, Copenhagen.

Poznanski, V., J.L.Beaven and P. Whitelock(1995) An Efficient Generation Algorithm forLexicalist MT. In Proceedings of the 33rd ACL,MIT.

Whitelock, P.J. (1994) Shake-and-BakeTranslation. In Constraints, Language andComputation. C.J.Rupp, M.A.Rosner andR.L.Johnson (eds.) Academic Press.

practical glossing by prioritised tiling - シャープ株 … · practical glossing by prioritised...

Documents