meaning cycle 2: acquisitionnlp/meaning/documentation/... · meaning cycle 2: acquisition document...

35
MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title Developing Multilingual Web-scale Language Technologies Project URL http://www.lsi.upc.es/˜nlp/meaning/meaning.html Availability Public Authors: John Carroll (Sussex), German Rigau (UPV/EHU), Bernardo Magnini (ITC-irst), Eneko Agirre (UPV/EHU), Horacio Rodriguez (UPC), Jordi Atserias (UPC) INFORMATION SOCIETY TECHNOLOGIES

Upload: others

Post on 24-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

MEANING Cycle 2: Acquisition

Document Number D5.2Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scale Language TechnologiesProject URL http://www.lsi.upc.es/˜nlp/meaning/meaning.htmlAvailability PublicAuthors: John Carroll (Sussex), German Rigau (UPV/EHU), Bernardo Magnini(ITC-irst), Eneko Agirre (UPV/EHU), Horacio Rodriguez (UPC), Jordi Atserias(UPC)

INFORMATION SOCIETY TECHNOLOGIES

Page 2: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 1

Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scale

Language TechnologiesSecurity (Distribution level) PublicContractual date of delivery February 29 2004Actual date of delivery March 31 2004Document Number D5.2Type ReportStatus & version v FinalNumber of pages 34WP contributing to the deliberable Work Package 5WPTask responsible John Carroll (Sussex)Authors

John Carroll (Sussex), GermanRigau (UPV/EHU), BernardoMagnini (ITC-irst), Eneko Agirre(UPV/EHU), Horacio Rodriguez(UPC), Jordi Atserias (UPC)

Other contributorsReviewerEC Project Officer Evangelia MarkidouAuthors: John Carroll (Sussex), German Rigau (UPV/EHU), Bernardo Magnini(ITC-irst), Eneko Agirre (UPV/EHU), Horacio Rodriguez (UPC), Jordi Atserias(UPC)Keywords: NLP, Lexical Knowledge Representation, Acquisition, WSDAbstract: We summarise the work carried out in Meaning Cycle 2 on the acqui-sition of lexical information to feed into improved word sense disambiguation. Theresearch has been organised as several experiments, which are described in detail inMeaning Working Papers.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 3: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 2

Contents

Introduction 4

1 Experiment 5.A a) – Multilingual Knowledge Acquisition 61.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Multilingual comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Evaluation and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Experiment 5.A b) – Multilingual Knowledge Acquisition: Exploring thePortability of Syntactic Information from English to Basque 102.1 Introduction and background . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Source data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Conclusions and further work . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Experiment 5.B – Collocation 13

4 Experiment 5.C – Acquisition of Domain Information for Named Entities 14

5 Experiment 5.D – Topic Signatures 155.1 Introduction and background . . . . . . . . . . . . . . . . . . . . . . . . . 155.2 Source data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.5 Results of Experiment 5.D.a . . . . . . . . . . . . . . . . . . . . . . . . . . 185.6 Results of Experiment 5.D.b . . . . . . . . . . . . . . . . . . . . . . . . . . 185.7 Results of Experiment 5.D.c . . . . . . . . . . . . . . . . . . . . . . . . . . 185.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Experiment 5.E – Sense Examples 206.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.2 ExRetriever: a new approach . . . . . . . . . . . . . . . . . . . . . . . . . 206.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.4 Conclusions and future work . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 Experiment 5.F – Lexical Knowledge from MRDs 24

8 Experiment 5.G – Improved Selectional Preferences 25

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 4: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 3

9 Experiment 5.H – Clustering WordNet Word Senses 269.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269.2 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

10 Experiment 5.I – Multiwords: Phrasal Verbs 29

11 Experiment 5.J – New Senses 31

12 Experiment 5.K – Ranking Senses Automatically 32

References 33

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 5: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 4

Introduction

Acquisition of lexical information in Cycle 2 of Meaning focused on acquiring informationfrom large text collections and the web, to feed into improved word sense disambiguation.In this develiverable we summarise twelve co-ordinated experiments (named A–K) thatform this cycle. The experiments are:

Experiment 5.A Investigating multilingual acquisition (two experiments).

Experiment 5.B Detecting new noun-noun and adjective-noun collocations.

Experiment 5.C Acquiring domain information for named entities.

Experiment 5.D Acquiring topic signatures from large corpora and the web.

Experiment 5.E Acquiring examples of words used in particular senses from large textcollections.

Experiment 5.F Acquisition of lexical knowledge from MRDs.

Experiment 5.G Acquiring improved selectional preferences.

Experiment 5.H Acquisition of sense clusters.

Experiment 5.I Acquiring multiwords: phrasal verbs.

Experiment 5.J Acquiring new senses.

Experiment 5.K Acquiring sense rankings.

Experiment 5.A consisted of two studies into how the knowledge obtained in one lan-guage can help in the acquisition process performed in other languages. These experimentsaimed also to shed some light on the use of the knowledge acquired using general anddomain-specific corpora.

Experiment 5.B was intended to explore new ways to detect new noun-noun andadjective-noun collocations from large amounts of English text. Unfortunately there wasinsufficient effort available to continue this experiment, so it was suspended.

The goal of experiment 5.C was to automatically acquire domain information for namedentities. However, due to lack of available effort it has been suspended.

Experiment 5.D was devoted to acquisition of topic signatures from corpora and theweb. Topic signatures associate a topical vector to each word sense. The dimensions of thistopical vector are the words in the vocabulary and the weights capture which are the wordscloser to the target word sense. This experiment had three main goals: a) to check whetherthe current techniques for constructing topic signatures are satisfactory, b) to assess theusefulness of topic signatures on Meaning related tasks, particularly domain information

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 6: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 5

acquisition for word senses and clustering of word senses, and c) to carry out large-scaleacquisition of topic signatures.

Experiment 5.E focused on acquiring from large text collections examples of words usedin particular senses. The goal of this experiment was to test the feasibility of automaticallygenerating arbitrarily large corpora for supervised WSD training. For this purpose anew software tool has been developed. ExRetriever characterises automatically eachsynset of a word as a query (using mainly synonyms, hyponyms and the words from thedefinitions in a WordNet), and then uses these queries to obtain sense examples (sentences)automatically from large text collections.

In experiment 5.F, the consortium had planned to acquire semantic relations fromBasque, Spanish and English monolingual dictionaries. However, this experiment promisedlittle that was novel, so it has been suspended.

Experiment 5.G explored methods to acquire selectional preferences which instead ofcovering all the noun senses in WordNet, just give a probability distribution over a por-tion of “prototypical classes”, where that portion can be disambiguated and where thedisambiguation is performed using a ratio of types in a class, rather than tokens.

There is considerable literature on what makes word senses distinct, but there is nogeneral consensus on which criteria should be followed. From a practical point of view,the need to make two senses distinct will depend on the target application. In experiment5.H a set of automatic methods to hierarchically cluster the word senses in WordNet wasexplored.

Experiment 5.I investigated a method for unsupervised acquisition of English phrasalverbs (as distinct from verbs that take prepositional complements).

Experiment 5.J was to have been concerned with acquiring new senses; it has beenpostponed to the final round of acquisition.

The first sense heuristic which is often used as a baseline for supervised WSD systemsfrequently outperforms WSD systems which take context into account. This is largelybecause of the skewed frequency distribution of the senses for many words. Experiment5.K developed a method for automatically detecting this skew, ranking WordNet sensesaccording to their prevalence in corpora.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 7: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 6

1 Experiment 5.A a) – Multilingual Knowledge Ac-

quisition

1.1 Introduction

This experiment presents a semantic-driven methodology for the automatic acquisition ofverbal models. Our approach relies strongly on the semantic generalizations allowed byalready existing resources (e.g. Domain labels, Named Entity categories, SUMO ontology,etc) and the current capabilities of the linguistic processors. Several experiments havebeen carried out using comparable corpora in four languages (Italian, Spanish, Basque andEnglish) and two domains (FINANCE and SPORT) showing that the semantic patternsacquired can be general enough to be ported from one language to the other language.

Being a multidimensional problem, predicate knowledge is one of the most complextypes of information to acquire. Predicates (verbs and their corresponding nominalizations)are essential for the development of robust and accurate parsing technology capable ofrecovering predicate-argument relations and logical forms. Without it, resolving moststructural ambiguities of sentences is difficult, and understanding language impossible.

Full account of predicate information requires specifying the number and type of ar-guments, predicate sense under consideration, semantic representation of the particularpredicate-argument structure, mapping between the syntactic and semantic levels of rep-resentation, semantic selectional restrictions/preferences on participants, control of theomitted participants and possible diathesis alternations. Unfortunately, all these kinds ofknowledge are interdependent.

The work here presented explores some basic issues in the acquisition of semantic mod-els. First, how the current technology and the knowledge available can help large-scaleacquisition tasks, mainly subcategorization frames (SCFs) and selectional restrictions orpreferences (SPs) for Spanish, Italian, Basque and English. Second, the impact in theacquisition process when using several languages at the same time and third, when usingdomain corpus instead of a general corpus.

1.2 Multilingual comparison

Summarising, the experiment will try to study new ways for restricting the search spacewhen performing acquisition tasks, in order to obtain more accurate knowledge for somelanguages and balance the coverage of such knowledge across languages. Thus, this ex-periment can be also seen as a common framework to study productive paths to exploitappropriately:

• available semantic knowledge (wordnets, Semantic Files, MultiWordNet Domains,EuroWordNet Top Ontology, Sumo, etc. already present into the Mcr [Atserias et al.2004])

• cross language discrepancies/agreements through the EuroWordNet Interlingual In-dex

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 8: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 7

• available comparable domain corpora in several languages

• large-scale selectional preferences already acquired from this multilingual corpora[Atserias et al.2003, Agirre et al.2003]

We carried out the experiment for particular verbal synsets which have common sensesin the considered languages. For instance, the following verbal synsets belongs to thesame Ili–record: English verbal synset 01564908–v <gain,clear,make,earn,realize>, Italian<reallizare,guadagnare>, Spanish <ganar> and Basque <irabazi>.

First, we collect sentences containing those verbs in comparable corpora for both do-mains FINANCE and SPORT. For each sentence, depending of the current capabilities ofthe Linguistic Processors used, we obtained the heads of the verb–slots acting possibly assubjects and objects. Only the English linguistic processor RASP [Bricoe and Carroll2002]performs high accurate dependency analysis.

For the rest of languages, in the pre-processing phase the sentences are PoS taggedand parsed into non-recursive phrasal units. The quality of parsing, especially with re-spect of NP chunks, is a crucial factor in the success of analysis. For Basque, we useda chuncker based on an unification grammar. For Italian and Spanish, in order to ex-tract subject/object groups, three simple heuristics are applied: first, consider NP groupsdirectly at the left hand side and at the right hand of the VP, second, identify passivesand the postponed subject, and, finally, the VP NP NP case. As Italian and Spanish aresubject-drop languages, we also use simple heuristics, based on barriers phrases, to detectthe subject/object-drop cases.

Finally, once the subject/object pairs are extracted we associated a Named Entitycategory (or Semantic File from Wn) and a Domain label to each head of the nominalgroups. We also implemented a very simple generalization procedure associating to eachverb one or more semantic patterns of type Name Entity+WN Domain on the base of theirfrequency.

In order to work with compatible representations across languages, we obtained foreach verb–slot filler all their synsets. We also mapped the Named Entities types (PER-SON, ORGANIZATION, AMOUNT, PERCENTAGE, DATE, etc.) to a common semanticrepresentation. In the next tables, NONE stands for words that doesn’t appear in the localwordnets and NO SUBJECT/NO OBJECT represents sentences where the subject/objectwas not detected.

1.3 Evaluation and discussion

Section 7 of Deliverables D2.1 and D2.2 describe several ways to perform quantitativeevaluation on the acquisition of subcategorization and selectional preferences. For bothcases, type, token and task evaluations.

However, being this a preliminary and exploratory experiment (with many, hard andbiased simplifications) we performed only a qualitative evaluation. That is, comparing

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 9: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 8

among them the semantic patterns coming from the translation equivalent verbs selectedfrom each language-domain (and with other equivalent resources available). Thus, thisevaluation have been done in terms of overlapping patterns, etc.

It seems clear that none of the large–scale acquired knowledge resources seems to beaccurate enough by its own. Instead, it seems to be a more appropriate to devise collabo-rative and productive ways to filter out too general or erroneous corpus examples, patternsor Selectional Preferences.

In order to continue performing collaborative multilingual knowledge acquisition analy-sis, Work Part WP3 needs to ensure consistency outputs of the different Linguistic Proces-sors (LPs) already available. This means for instance, to provide comparable full NERCcapabilities to all LPs, anaphora, etc. if future plans of Meaning contains large–scaleexploitation of Domain news articles. In fact, Meaning needs a more detailed analysis ofNamed Entities (for instance, this is a DOCTOR rather than this is a PERSON), an issueclosely related to experiment WP5.C ”Domain knowledge for NE”.

Regarding the Top Concept Ontology, a more detailed analysis is also suggested. Word-Net Semantic Files (or Lexicographic Files) can be seen as a simplification of EuroWordNetTop Ontology. The results obtained in this experiment suggest that the EuroWordNet TopOntology could be a good reference for generalising conceptual patterns such as agent orpatient roles. We also suggest to map Named Entities to the EuroWordNet Top Ontology.

Finally, selectional preferences has been used without expansion. This means that noinheritance/generalization has been performed. As the selectional preferences have beenacquired by means of some kind of generalizations, we also suggest (when applicable)to provide a full expansion through the nominal part of the hierarchy. This means forthose selectional preferences acquired from SemCor and class–based selectional preferencesacquired from BNC.

1.4 Conclusions

Automatic acquisition of semantic patterns for predicate structures (verbs and their cor-responding nominalizations) is one of the most complex task for lexical acquisition. Verbsshow multidimensional and interdependent features (selectional preferences, diathesis alter-nations, subcategorization frames) and their behavior may vary not only across languages,but also across corpus domains and genre. These facts are problematic for any syntax-driven approach [Atserias et al.2001].

We proposed a cross-language methodology of acquiring semantic patterns for predi-cates. The pilot study we have conducted shows that it is possible to obtain promisingresults using this framework, if we consider the high level of polysemy degree we are dealingwith. We used very simple criteria together with large collections of comparable corporaand already existing semantic resources to acquire large amounts of semantic patterns thatcan be very useful for a number of applications based on shallow semantics. To our knowl-edge, this is the largest multilingual experiment carried out to acquire semantic patternsfor predicates. We are working in parallel with four languages altogether, with real text

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 10: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 9

and with the available Linguistic Processors and semantic resources. Obviously, the wholeprocess can be widely improved in several steps, in particular the semantic generalizationprocess.

Finally, we also plan to evaluate the application of the acquired cross-lingual modelsin particular NLP tasks, such as PP–attachment [Agirre et al.2004b], detection of sub-ject/objects or WSD.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 11: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 10

2 Experiment 5.A b) – Multilingual Knowledge Ac-

quisition: Exploring the Portability of Syntactic In-

formation from English to Basque

2.1 Introduction and background

The goal of this experiment is to explore the possibilities to port the linguistic knowledgeacquired in one language to another. This portability issue could be especially relevantfor minority languages with few resources like Basque. If we were able to transfer some ofthe linguistic technologies and resources available for English to some other languages wecould overcome the limitations originated by the lack of resources for these languages (forexample small corpora, lack of hand annotated corpora, etc.).

We will concentrate in PP attachment problem for Basque. This problem is especiallyhard for languages like Basque, which shows a free word order. Our current parser makesattachment decisions based on heuristics. We devised an experiment where we will transferattachment information coming from English parsed data, and the attachment decisionswill be based on this transferred information.

In order to test our method, we selected sentences with two verbs, and used a partialparser (Gojenola 2000, Aldezabal et al 2000) to obtain just the chunks in the sentences.The heads of the noun groups are extracted, and a set of possible dependency (verb-noun)pairs is built. The goal is to select for each noun which verb should it be attached to.

The method works as follows. We first translate the verbs and all the surroundingheads (nouns) into English, and for each (verb-noun) Basque pair we search all possibletranslation combinations in the dependency database built from an automatically parsedEnglish corpus. Note that we only search for the English verb and noun occurring in adirect syntactic dependency. We collect the frequencies of all translated pairs for each(verb-noun) pair in Basque. In order to select the correct attachment for each noun, thefrequencies collected for each of the two (verb-noun) pairs are compared. A larger amountof translation occurrences found (maintaining a syntactic relation) is taken as an indicativeof a stronger relation between the head and one of the verbs, the one that will be selected.We also tried the mutual information (MI) of the pairs.

2.2 Source data

The English corpora (target corpora) are conformed by the BNC and the Reuters newswirefinances and sports corpus. The English parser used is the RASP dependency parserdeveloped by Carroll and Briscoe (2001). It is important to notice that a dependencyparser links the head of the phrases to the verbs in contrast to what some other kind ofparsers do, linking whole phrases to the verbs. This is a relevant feature because it willfacilitate the search.

In order to make efficient searches into the English parsed corpora we created a database

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 12: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 11

for each corpus (BNC, Reuters sports, Reuters finances), where each tuple represents asyntactic relation between a verb and a dependent (head of a noun or prepositional phrase).We collected 47,145,584 syntactic relations from BNC, 252,579 relations from EFE sports,614,289 relations from EFE finances, 1,439,445 from Reuters sports and 9,858,633 fromReuters finances.

The Basque corpus (source corpus) comes from a Basque newspaper and it refers tonews from some months of the year 2000 (33,669 sentences) in different domains (culture,sports, finances, politics, etc.). From this corpus we chose sentences having two verbs,where one of the verbs belonged to a list of 7 verbs (to win, to loose, to increase, todecrease, to tie, to train, to play). The number of sentences obtained this way was 343,where we find 1135 syntactic relations which could belong to either of the two verbs in thesentence.

2.3 Results

The results in show that the system performs with precisions well above the random base-line (0.5 in this case) for all combinations of source and target corpora. In all cases MIattains better results than using the raw frequencies. The best precision value is 0.72,obtained when searching over Reuters sports for deciding on (verb-noun pairs) relationsbelonging to the Basque sports section, showing that narrowing the domain of the textsis useful to improve the results. Searching in Reuters-sports also provides the best resultseven to make decisions over relations belonging to the finances and any section of theBasque news paper. The reason could be that the verbs selected are highly tied to thesports domain, and even within some other sections we still get better results searching onthe sports corpus.

2.4 Conclusions and further work

This work aimed at exploring the portability of linguistic knowledge from one languageto another. The results reported suggest that the transfer is possible, as a very simpletechnique which searches on English dependencies is able to make valuable PP attachmentdecisions in Basque. This could be especially helpful to deal with the structural ambiguityproblems in Basque that scrambling poses to an already difficult task like PP attachment.

The system we have developed uses comparable corpora, as opposed to parallel corpora,which makes it very suitable for languages where parallel corpora is not easy to find., andalso allows us to get large amounts of corpora linked to any target domain.

For our future work we plan to improve the precision of the system with a bettertreatment of multiwords (including Named-Entities) and using other dependencies in thesource sentence in order to narrow the search space in English. For the same reason, wewould also like to include frequency information in the preposition-postposition equivalencetables. Besides we plan to combine this multilingual system with the heuristics alreadycoded in the Basque parser.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 13: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 12

We also plan to extend this work to the acquisition of selectional preferences in thetarget language, and use it in the source language. Finally, we would like to take theattachment decisions of all the phrases in the sentence at the same time, in order to takethe best decision overall.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 14: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 13

3 Experiment 5.B – Collocation

The effort that had been assigned to 5.B has been reallocated to Experiment 5.D.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 15: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 14

4 Experiment 5.C – Acquisition of Domain Informa-

tion for Named Entities

This experiment has been suspended.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 16: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 15

5 Experiment 5.D – Topic Signatures

5.1 Introduction and background

The Meaning topic signature task has extracted topic signatures for all word senses inthe Multilingual Central Repository (Mcr, linked to WordNet version1.6). This is a sub-task of WP5 (Acquisition) and topic signatures will be used in further acquisition tasks.Topic signatures also have potential uses in Word Sense Disambiguation (WP6) and Cross-Lingual Information Retrieval and Question Answering (WP8).

Topic signatures try to associate a topical vector to each word sense. The dimensionsof this topical vector are the words in the vocabulary and the weights try to capture therelatedness of the words to the the target word sense. In other words, each word sense isassociated with a set of related words with associated weights.

For instance, the first sense of church glossed as ”a group of Christians; any groupprofessing Christian doctrine or belief” might have the following topic signature associatedwith it:

church(1177.83) catholic(700.28) orthodox(462.17) roman(353.04) religion(252.61)byzantine(229.15) protestant(214.35) rome(212.15) western(169.71) established(161.26)coptic(148.83) ...

We can build such lists from a sense-tagged corpora just observing which words co-occur distinctively with each sense. The problem is that sense-tagged corpora are scarce.Alternatively we can try to associate a number of documents from existing corpora to eachsense and then analyze the occurrences of words in such documents.

In this experiment we have followed two strategies to build topic signatures:

1. Using monolingual corpora and monosemous relatives. We have followed themonosemous relatives method, inspired in [Leacock et al.1998]. This method usesmonosemous synonyms or hyponyms to construct the queries. For instance, the firstsense of channel in WordNet has a monosemous synonym ”transmission channel”.All the occurrences of ”transmission channel” in any corpus can be taken to refer tothe first sense of channel. In our case we have used the following kind of relationsin order to get the monosemous relatives: hypernyms, direct and indirect hyponyms,and siblings. The advantages of this method is that it is simple, it does not neederror-prone analysis of the glosses and it can be used with languages where glossesare not available in their respective WordNets.

2. Using a second language. We use a novel approach which takes advantage of thedifferent way in which word senses are lexicalised in typologically distant languages(English and Chinese in our current experiments). The method involves automati-cally translating a polysemous English word w to Chinese using a bilingual dictionary,

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 17: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 16

using the results to query large Chinese corpora or a search engine, and finally trans-lating the resulting Chinese corpora word by word back to English, thus producinga topic signature for each sense of w.

Topic signatures for words have been successfully used in summarization tasks [Lin and Hovy2000],and [Agirre et al.2000, Agirre et al.2001] have shown that it is possible to obtain good qual-ity topic signatures for word senses.

In this second round, Meaning applied past research into the first approach above(which was previously tested on a limited number of words) to a vast amount of wordsenses, i.e. all word senses of English nouns in the MCR. This will allow to apply topicsignatures extensively to the following areas in meaning:

• Word Sense Disambiguation (cf. Experiment WP6.H.b in deliverable D6.2).

• Clustering word senses and synsets (cf. Experiment WP5.H in this deliverable).

• Classifying new concepts in the WordNet hierarchy (see [Alfonseca and Manandhar2002]),which will be explored in the next round.

The second, bilingual approach is much more recent, and in this round has only been testedon a limited number of words (although still using very large scale text resources for thequerying stage).

5.2 Source data

Depending of the approach taken to build the topic signatures, we have used the followingdata:

1. The basic source to build topic signatures has been the web used like a huge corpora.After constructing the topic signature we have used the British National Corpus(BNC) for filtering processes. The source for word senses has been MultilingualCentral Repository (Mcr, linked to WordNet version 1.6). The information for eachword sense in Mcr is used to build queries which are fed into Google.

2. For the bilingual approach, our dictionaries were the Yahoo! Student English ChineseOn-line Dictionary and the LDC Chinese English Translation Lexicon Version 3.0.The Chinese corpus data was retrieved from the Mandarin portion of the ChineseGigaword Corpus and the People’s Daily On-line via its own search engine.

5.3 Experiments

On the one hand, we used the monosemous queries to build topic signatures from the webfor all English noun senses in the Mcr. In order to build the queries we have used the

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 18: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 17

monosemous relatives method. For each monosemous noun in the Mcr (WN1.6) we builda query and retrieved examples in Google. For each query we are able to set the maximumto 1000 snippet. The snippets are treated to extract meaningful sentences.

On the other hand, we have devised further experiments to evaluate the quality ofalternative ways to construct topic signatures, and the utility of the topic signatures inother NLP tasks (see below).

In short, we performed the following experiments:

Experiment 5.D.a, Working Paper 5.5.a Build publicly available topic signatures forall WordNet 1.6 noun senses (that is, the English part of the Mcr). These topicsignatures are based on the monolingual approach.

Experiment 5.D.b, Working Paper 5.5.b Compare similarity measures based on TopicSignatures to other, hierarchy-based similarity measures for the monolingual ap-proach.

Experiment 5.D.c, Working Paper 5.5.c Evaluate the usefulness of the Topic Signa-tures on Word Sense Disambiguation (for the bilingual approach).

Experiment 5.H, Working Paper 5.9 Use topic signatures to cluster word senses (forthe monolingual approach).

5.4 Evaluation

Evaluation of automatically acquired semantic and world knowledge information is not aneasy task. There is no gold standard for topic signatures and hand evaluation is arbi-trary. Therefore the usefulness of topic signatures for a task is evaluated, rather than theirintrinsic quality.

The evaluation tasks are:

Experiment 5.D.b, Working Paper 5.5.b Compare similarity measures based on TopicSignatures to other, hierarchy-based similarity measures. The correlation of a varietyof similarity measures based on topic signatures are compared to usual hierarchy-based approaches.

Experiment 5.D.c, Working Paper 5.5.c Evaluate the usefulness of the Topic Signa-tures on Word Sense Disambiguation (for the bilingual approach).

Experiment 5.H, Working Paper 5.9 Compare the quality of the clusters producedby topic signatures with respect to other clustering strategies.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 19: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 18

number of polysemous nouns 15,875number of senses 37,678average senses per noun 2.38number of examples 135,040,841average examples per sense 3,584.1average examples per noun 8,527.9

Table 1: Data for the examples gathered for all the senses of polysemous nouns in theMCR.

total size 228,331,038total size (non zero weights) 158,099,870average size per signature 6,623.5average size (non zero weights) 4,587.2

Table 2: Size in words of the topic signatures of the polysemous nouns in the MCR.

5.5 Results of Experiment 5.D.a

Topic signatures following the monolingual approach based on monosemous relatives havebeen gathered for all English nominal word senses in the MCR. The data about the gatheredexamples is shown in tables 1 and 2. All examples and constructed topic signatures canbe browsed and downloaded from the following URLs:

• http://ixa3.si.ehu.es/cgi-bin/signatureak/signaturecgi.cgi

• http://ixa2.si.ehu.es/pub/webcorpus

5.6 Results of Experiment 5.D.b

The results show that it is possible to approximate very accurately the similarity metricbased on link distance (also called semantic distance), as it is possible to attain a simi-larity of 0.88 with monosemous relatives and the MI or the t-score weight functions. Thesimilarity between Resnik’s function and the signatures was somewhat lower, with a cosinesimilarity of 0.65 (again with the MI weight function, but with the all relatives signature).

Regarding the parameters of topic signature construction, the monosemous relativemethod obtains the best correlation when compared to the link distance gold standard. Asthis method uses a larger amount of examples than the all relatives method, we cannot beconclusive on this.

5.7 Results of Experiment 5.D.c

We built a WSD system which used as input data the topic signatures built by the bilingualapproach. For testing we used the TWA Sense Tagged Data Set which consists of six nouns

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 20: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 19

with two-way sense ambiguity. See D6.2 for details of the experiments.

5.8 Conclusions

According to the experiments carried out related to topic signatures we conclude the fol-lowing:

• Topic signatures constructed using the monolingual method based on monosemousrelatives have been succesfully used to compute the similarity between nominal wordsenses [Agirre and Lopez2003] and also to cluster nominal word senses [Agirre et al.2004a].

• We have made publicly available topic signatures for all English nominal word sensesin the MCR.

• Topic signatures constructed using the bilingual approach, when used for WSD per-form a little above the ‘supervised’ baseline, and beat the random choice baseline bya large margin.

In the next round we intend to improve the methods to obtain topic signatures, introduc-ing more powerful techniques to construct the queries (for example building on the workof [Widdows2003]). We also plan to use topic signatures and other distributional semanticmethods to classify new senses in the MCR.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 21: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 20

6 Experiment 5.E – Sense Examples

6.1 Introduction

A current research line for word sense disambiguation (WSD) focuses on the use of su-pervised machine learning techniques. One of the drawbacks of using such techniques isthat previously sense annotated data is required. This paper presents the current statusof ExRetriever, a new software tool for automatically acquiring large sets of sense taggedexamples from large collections of text and the Web. ExRetriever exploits the knowledgecontained in large-scale knowledge bases (e.g., WordNet) to build complex queries, each ofthem characterising particular senses of a word. These examples can be used as traininginstances for supervised WSD algorithms.

Supervised Machine Learning algorithms use semantically annotated corpora to induceclassification models for deciding which is the appropriate word sense for each particularcontext. The compilation of corpora for training and testing such systems require a largehuman effort since all the words in these annotated corpora have to be manually tagged bylexicographers with semantic classes taken from a particular lexical semantic resource, mostcommonly WordNet. Supervised methods suffer from the lack of widely available semanti-cally tagged corpora, from which to construct really broad coverage systems [Ng1997]. Thisextremely high overhead for supervision (all words, all languages) explain why supervisedmethods have been seriously questioned.

As a possible solution, some recent work is focusing on reducing the acquisition costand the need for supervision in corpus-based methods for WSD. [Leacock et al.1998],[Mihalcea and Moldovan1999] and [Agirre and Martinez2000] automatically generate ar-bitrarily large corpora for unsupervised WSD training, using the knowledge contained inWordNet to formulate search engine queries over large text collections or the Web.

In order to test the feasibility of this approach, the Meaning project1 has developedand released a new tool: the first version of ExRetriever, a flexible system to perform sensequeries on large corpora. ExRetriever characterize automatically each synset of a word asa query (using mainly, synonyms, hyponyms and words from the definitions); and then,using these queries to obtain sense examples (sentences) automatically from a large textcollection. The current implementation of ExRetriever access directly the content of theMcr [Atserias et al.2004]. The system is using also SWISH-E to index large collections oftext such as SemCor or BNC. ExRetriever has been designed to be easily ported to otherlexical knowledge bases and corpora, including the possibility to query search engines suchas Google.

6.2 ExRetriever: a new approach

Although, this approach seems to be very promising, it remains unclear which is the beststrategy for building sense queries from a large-scale knowledge base like WordNet. ExRe-

1http://www.lsi.upc.es/~nlp/meaning

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 22: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 21

triever will explore the trade-off between coverage (collecting large quantities of senseexamples) and accuracy (making queries more precise and restrictive, and obviously lessproductive).

First experiments have been performed using large scale corpora stored locally. Thisapproach allowed to perform very fast a large set of controlled tests, experiments andevaluations, comparing different query buiding strategies. Later, when having a more clearview of the method and the knowledge to be used, we will perform much larger acquisitionof examples from the web.

The current experiments performed by ExRetriever include the study of:

• the knowledge to be used (e.g. regarding PoS, monosemous relatives only, synonyms,direct hypernyms, direct hyponyms, involved relations, etc.)

• the query construction (e.g. including or not AND-NOTs with characterizations ofthe other sense queries)

• the complete query process (e.g. union set of queries, incremental construction, etc.)

• the post processing (e.g. using PoS, syntactic or domain filtering)

• the other languages involved in the project (using the Mcr)

• the use of different corpora

This tool characterises each sense of a word as a specific query. This is automaticallydone by using a particular query construction strategy, which is defined a priory by anexpert. Each different strategy can take into account the information related to words andavailable into a lexical knowledge base in order to automatically generate the set of queries.

The current version of ExRetriever is able to use different lexical databases throughthe Mcr of Meaning [Atserias et al.2004] and different corpora (SemCor, BNC, the Web,etc.) through a common API.

In order to easily implement different query construction strategies, ExRetriever hasbeen powered with a declarative language. This language allows the manual definition ofcomplex query construction strategies.

6.3 Experiments

Within the framework of the Meaning project we designed a preliminary test set tovalidate ExRetriever. Both direct and indirect evaluation experiments of the ExRetrieverperformance have been designed. However, in this report we present the results of thedirect evaluation on SemCor.

Using ExRetriver on SemCor we can perform detailed micro-analisys on the data avail-able. That is, we can easily perform many adjustements for building queries to filter outthose unwanted examples, balancing the trade-off between coverage (we want to obtain all

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 23: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 22

the examples of a particular sense occurring in a corpus) and precision (we want only thosecorresponding to the particular sense).

Each one of such experiments consists of applying a particular query construction strat-egy to a subset of the 73 English words from Senseval-2 lexical sample task. The resultingspecific queries (one for each word sense) automatically generated by applying each strategyhave been tested against Semcor. Due to the small size of Semcor (around 250 thousandwords), specific queries are likely to produce poor recall. However, Semcor is the uniquesense tagged resource providing large quantities of examples for all-words.

Six different query construction strategies have been tested, some of them inspired inthose used by other authors.

Q Ok Ko NoTag #Sense P R F1Lea1 1551 1569 2037 12744 50 11 18Mol1 257 209 436 5129 55 5 9Mol3 2195 26734 2962 6122 8 7 7Mea1 2978 27882 4318 10390 10 8 9Mea2 6227 56038 9884 14595 10 9 9

Table 3: Overall results for the different query construction strategies.

Table 3 shows the overall figures for each query when applied to the total 73 words of thetest set. When applying sistematically the same method to all the words, Moldo1Semcorand Lea1Semcor strategies obtain the best precision (55% and 50% respectivelly). How-ever, Lea1SemCor method obtains much better recall than Modo1Semcor (11% vs.5%). The poor results on precision for both Lea1SemCor and Moldo1SemCor querystrategies were unexpected. The reason for the low scores on precision is that some of themonosemous relatives are not really monosemous in WordNet. They may have other sensesfor other POS. However, this problem can be easily fixed verifying the POS of the examplesretrieved performing a POS tagging and removing those examples with missleading POS.We expect to solve this problem in the last Meaning round.

6.4 Conclusions and future work

In this experiment, ExRetriever, a query-based system to extract sense examples fromcorpus has been described. Some preliminar experiments have been presented. They havebeen used to evaluate the performance of different types of query construction strategies.Using the powerful and flexible declarative language of ExRetriever, new strategies can beeasily defined, executed and evaluated.

We plan to experiment other strategies. For instance, performing full parsing on theglosses could help discarding irrelevant words from glosses. In addition, using the knowl-edge already contained into the Multilingual Central Repository (e.g., selectional prefer-ences acquired from the BNC, eXtended WordNet, domain information, the Topic Signa-

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 24: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 23

tures acquired from the Web, etc.) could be useful knowledge to better model sense wordsas queries. Moreover, we plan to use alternative schemes for building queries, such as theincremental process performed by [Leacock et al.1998].

Another very promising line of research will follow [Widdows2003]. This work presentsa theoretically motivated method for removing unwanted meanings directly from the orig-inal queries represented in a vector model. Irrelevance in vector spaces is modelled usingorthogonality. Using this approach, query vector negation removes not only unwantedstrings but unwanted meanings. This method is applied to standard IR systems, process-ing queries such as “play NOT game”. This work presents an algebra to operate withword vectors rather than words. It seems, following this approach, that most of the errorsproduced because of the substitution of the target word for their relatives can be avoided.Furthermore, using this approach, we can also use other sense tagged corpora for directcomparisons of ExRetriever. Although DSO only provides sense tagged data 141 words(nouns and verbs), the are examples in large quantities (around thousands). In this case,queries can not include substitutive relatives, only query restrictions over the polysemoustarget word.

We also plan to perform indirect evaluations using supervised WSD systems on theacquired sense examples. Once acquired a sense tagged corpus using ExRetriever, we willuse several Machine Learning algorithms to perform several cross-comparisons with respectto other sense tagged resources (SemCor, DSO and those resources provided by Senseval).

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 25: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 24

7 Experiment 5.F – Lexical Knowledge from MRDs

This experiment has been suspended.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 26: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 25

8 Experiment 5.G – Improved Selectional Preferences

The tree cut models (tcms) representing selectional preferences which were acquired previ-ously typically suffered from an overly high level of generalisation, that is they consisted ofclasses that are very high in the WordNet hierarchy. Prototypical classes such as food asthe direct object of eat were sometimes hidden by a selectional preference for a class furtherup the hyponym hierarchy, such as entity. This was partly because of the polysemy of thetraining data and partly because the tcm method, using the minimum description lengthprinciple, covers all the data rather than looking for prototypical classes. We have investi-gated three methods to acquire more specific, and intuitive models and these are describedin full in working paper WP5.8. The method that worked best was use of a new type ofselectional preference model, referred to as a protomodel. protomodels reduce theimpact of noise, atypical arguments and polysemy by only covering data which can be dis-ambiguated with reference to the other arguments. The effect of polysemy is also lessenedby using types, rather than tokens, to define the classes that will represent the probabilitydistribution for the verb and grammatical relation. Thus frequent, but polysemous itemsare less likely to give rise to erroneous classes and overly-general preferences.

In work package 6, experiment G (working paper 6.10) we show that these modelsoutperform the tcms on a wsd task. For the experiment reported here, we evaluated themodels on a pseudo-disambiguation task. In this task, the selectional preferences determinewhich of two arguments is the one genuinely attested in the corpus data. The details ofthis experiment are reported in working paper 5.8. Table 4 shows the average precision andrecall over the direct-object tuples extracted from the BNC, under 10-fold cross validation.The protomodels outperform the tcms in both precision and recall.

For this experiment, we also contrasted the performance of the tcms and protomod-els with disambiguated input data. wsd of the input data was performed by others withinthe MEANING consortium (EHU and IRST). The details are provided in working paper5.8. There is a small increase in recall, whilst precision is largely unaffected. The improve-ment in recall occurs because wsd of the input data ensures that more probability massis assigned to classes that would otherwise not be covered by the models.

Model Avg Precision (var) Avg Recall (var)tcm 71 (0.48) 48 (0.21)protomodel 81 (0.49) 66 (0.59)

Table 4: Pseudo-disambiguation results.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 27: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 26

9 Experiment 5.H – Clustering WordNet Word Senses

9.1 Introduction

There is considerable literature on what makes word senses distinct, but there is no generalconsensus on which criteria should be followed. From a practical point of view, the needto make two senses distinct will depend on the target application. This is evident forinstance in Machine Translation, where some word senses will get the same translation(both television and communication senses of channel) are translated as kanal into Basque)while others don’t (groove sense of channel is translated as zirrikitu in Basque), dependingon the target and source languages.

In this experiment we explore a set of automatic methods to hierarchically cluster theword senses in WordNet.

9.2 Data sources

The clustering methods that we will examine are based on the following informationsources:

• Similarity matrix for word senses based on the confusion matrix of all systems thatparticipated in Senseval-2.

• Similarity matrix for word senses produced by [Chugur and Gonzalo2002] using trans-lation equivalences in a number of languages.

• Similarity matrix based on the Topic Signatures for each word sense. The topicsignatures were constructed based on the occurrence contexts of the word senses,which can be extracted from hand-tagged data or automatically constructed fromthe Web (See WP5D.a in this round, and also [Agirre et al.2000, Agirre et al.2001].

9.3 Experiment

In order to construct the hierarchical clusters we will use Cluto [Karypis2001], a generalclustering environment or the context of occurrence of each word sense in the form ofa vector. The simmilarity matrixes from the previous section are fed into Cluto, whichoutputs the clusters.

9.4 Evaluation

The gold standard is based on the manual grouping of word senses provided in Senseval-2. This gold standard is used in order to compute purity and entropy values for theclustering results. The quality of a clustering solution is measured using two different

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 28: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 27

Method Entropy PurityRandom – 0.748Confusion Matrixes 0.364 0.768Multilingual Similarity 0.337 0.799Topic signatures: Senseval (Worse) 0.378 0.744(Best) 0.338 0.775Topic signatures: Web (Worse) 0.358 0.764(Best) 0.209 0.840

Table 5: Results for the different clustering methods

metrics that look at the gold-standard labels of the word senses assigned to each cluster[Zhao and Karypis2001].

Some of the nouns in Senseval-2 have trivial clustering solutions, e.g. when all the wordsenses form a single cluster, or all clusters are formed by a single word sense. 20 nounshave non-trivial clusters and can therefore be used for evaluation.

9.5 Results

As a baseline we built a random baseline, which was computed averaging among 100random clustering solutions for each word. Each clustering solution was built assigninga cluster to each word sense at random. Two other baselines are based on the mistakesmade by automatic WSd systems participating on the Senseval-2 competition (confusionmatrixes) and amount of common translations across different languages (multilingualsimmilarity).

Table 5 shows the results in the form of average entropy and purity for each of theclustering methods used. The first three rows show the results for the three baselinemethods. For clustering based on topic signatures the best and worst clustering resultsare shown for all the combinations of parameters that were tried. Different results areshown depending on whether the examples were taken from Senseval-2 (hand-tagged data)or from the web examples.

According to the results, automatically obtained examples proved the best: the optimalcombination of clustering parameters gets the best results. Thus, automatically retrievedexamples from the Web are the best source for replicating the gold standard from thealternatives studied in this paper.

9.6 Conclusions

We have explored different methods to cluster WordNet word senses. The methods relyon different information sources: confusion matrixes from Senseval-2 systems, translation

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 29: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 28

similarities, hand-tagged examples of the target word senses and examples obtained auto-matically from the web for the target word senses. The clustering results have been eval-uated using the coarse-grained word senses provided for the lexical sample in Senseval-2.We have used Cluto, a general clustering environment, in order to test different clusteringalgorithms.

The best results are obtained with the automatically obtained examples, with purityvalues up to 84% on average over 20 nouns.

We have currently acquired 1,000 snippets from Google for each monosemous noun inWordNet (cf. Experiment 5.D.a). We plan to provide word sense clusters of comparablequality for all nouns in WordNet soon.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 30: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 29

10 Experiment 5.I – Multiwords: Phrasal Verbs

Phrasal verb candidates, e.g. blow up, are output from the rasp parser and this is ex-tremely useful for acquisition of both subcategorization frames and selectional preferences.When acquiring such lexical information for a verb, it is important to know when thereis a special interpretation required for the verb and particle combination so that thesecombinations are handled separately from the simplex case. Whilst it is possible to putevery single occurrence of a verb and particle combination into a lexicon this is not de-sirable. One wants to achieve generalisation and avoid redundancy, only storing detailswhich cannot be created from what is already there. We have investigated use of an auto-matic thesaurus, acquired using the method proposed by [Lin1998], to find an empiricalbasis for the compositionality of the candidate types. We investigated various measuresusing the neighbours of the phrasal verb from this thesaurus. Some of our measures com-pared these to the neighbours of the simplex verb. We assume that candidates which areless compositional are more likely to be genuine phrasal verbs that warrant an entry in alexicon.

In this experiment, we demonstrated that various measures using the nearest neigh-bours of a phrasal verb and its simplex counterpart show a highly significant correlationwith human compositionality judgements. We also show that these more compositionalcandidates, according to the human judgements, are more likely to feature in a man-maderesource, such as WordNet. The experiments are reported in detail in working paper 5.10which is based on [McCarthy et al.2003].

Table 6 gives a summary of our main results. We obtained significant results at the5% level using the Spearman Rank-Order Correlation Coefficient (rs) when computingthe overlap between the neighbours of the simplex and the phrasal verb (overlap).2 Theresults when reducing the neighbours of the phrasal verb to their simplex form (overlapS )showed an even stronger correlation and higher significance. We also had excellent resultsby simple computing the number of neighbours of the phrasal verb having the same particleas the phrasal verb (sameparticle) and the number of neighbours with the same particleas the phrasal verb, minus the equivalent number of the simplex neighbours (i.e. havingthe same particle as the target phrasal verb) (sameparticle-simplex ). 3 We demonstratedthat whilst other statistics commonly used for collocation extraction (the χ2 statistic,the log-likelihood ratio statistic and point-wise mutual information) were also (negatively)correlated with the human judgements, the relationship was less significant than for thethesaurus measures.

From our results we found that both mutual information (the best performing statistic)and sameparticle-simplex (the best performing thesaurus measure) were correlated withwhether the candidate was found in either WordNet or anlt. Again, the thesaurus measuresameparticle-simplex showed a higher correlation (see table 7).

2This result was using 50 neighbours, which was the optimal number of neighbours for this measure.3Both of these two measures were performed with all 500 neighbours.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 31: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 30

Measure Correlation statistic Z score Probability under Ho

overlap rs = 0.136 1.43 0.08overlapS rs = 0.303 3.18 <0.0007sameparticle rs = 0.414 4.34 <0.00003sameparticle-simplex rs = 0.49 5.17 <0.00003

Table 6: Correlation with human compositionality judgements.

Measure in WordNet in ANLTMI -2.61 -4.53sameparticle-simplex 3.71 4.59

Table 7: Mann Whitney Z scores showing correlation of two measures of compositionalitywith man-made resources.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 32: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 31

11 Experiment 5.J – New Senses

This experiment has postponed to the next acquisition cycle.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 33: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 32

12 Experiment 5.K – Ranking Senses Automatically

The first sense heuristic which is often used as a baseline for supervised wsd systemsfrequently outperforms wsd systems even when they take surrounding context into account.Whilst a first sense heuristic based on a sense-tagged corpus such as SemCor is clearlyuseful, there is a strong case for obtaining a first, or predominant, sense from untaggedcorpus data so that the wsd system can be tuned to the genre or domain at hand.

In working paper 5.10 we give the details of a method that ranks WordNet senses ac-cording to prevalence using only raw text. The method uses the neighbours, and associateddistributional similarity scores, provided by an automatically acquired thesaurus. The the-saurus is acquired from grammatical relation data obtained from parsing a corpus. Thequantity and distributional similarity scores of the neighbours pertaining to different senseswill reflect the dominance of the sense to which they pertain. This is because there will bemore relational data for the more prevalent senses compared to the less frequent senses.We relate the neighbours to the WordNet senses with a WordNet similarity measure.

In this experiment we obtained a ranking of WordNet noun senses using a thesaurusacquired from parsed data from the bnc. We evaluated the accuracy of our rankingmethod by determining the accuracy with which the first sense in SemCor is that providedby the automatic ranking. We experimented with six of the WordNet similarity measures(see working paper 5.11). The Jiang-Conrath measure was one of the best performingmeasures, and is efficient given the precompilation of required frequency files. The randombaseline for choosing the predominant sense over all these words (

∑w∈Words

1|sensesw|) is

32%. All WordNet similarity measures beat this baseline, with jcn giving 54% accuracy.Furthermore many of the errors can be explained by differences in the sense distributions ofthe bnc and SemCor. In work package 6, experiment K, we summarise the results obtainedwhen using the first sense acquired from the automatic rankings for disambiguating thenouns in SemCor and the Senseval-2 English all-words corpus. As well as being usefulfor determining the top ranking senses of a word, our method for ranking senses is alsouseful for identifying infrequent and potentially redundant word senses. We experimentedwith a threshold at the ranking score which cuts out a constant percentage of the sensetypes below this threshold. We show that the majority of sense types filtered did not occurin SemCor, (57% for jcn which is above the baseline of 38%).

In working paper 5.10 we also show that the method for ranking WordNet senses can beapplied to two domain-specific corpora, finance and sport, and that the predominantsenses found by this method are appropriate, given the corpora. We are planning anextensive quantitative evaluation in the near future.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 34: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 33

References

[Agirre and Lopez2003] E. Agirre and O. Lopez. Clustering wordnet word senses. Proceed-ings of the Conference on Recent Advances on Natural Language (RANLP’03), 2003.

[Agirre and Martinez2000] E. Agirre and D. Martinez. Exploring automatic word sensedisambiguation with decision lists and the web. In Proceedings of the COLING workshopon Semantic Annotation and Intelligent Annotation, Luxembourg, 2000.

[Agirre et al.2000] E. Agirre, O. Ansa, D. Martinez, and E. Hovy. Enriching very largeontologies with topic signatures. In Proceedings of ECAI’00 workshop on Ontology Learn-ing, Berlin, Germany, 2000.

[Agirre et al.2001] E. Agirre, O. Ansa, D. Martnez, and E. Hovy. Enriching wordnet con-cepts with topic signatures. In Proceedings of the NAACL workshop on WordNet andOther lexical Resources: Applications, Extensions and Customizations, Pittsburg, 2001.

[Agirre et al.2003] E. Agirre, I. Aldezabal, and E. Pociello. A pilot study of english selec-tional preferences and their cross-lingual compatibility with basque. In Proceedings of theInternational Conference on Text Speech and Dialogue (TSD’2003), Cesk Budojovice,Czech Republic, 2003.

[Agirre et al.2004a] E. Agirre, E. Alfonseca, and O. Lopez. Approximating hierachy-basedsimilarity for wordnet nominal synsets using topic signatures. Proc. of the 2nd GlobalWordNet Conference, 2004.

[Agirre et al.2004b] E. Agirre, A. Atutxa, K. Gojenola, and K. Sarasola. Exploring porta-bility of syntactic information from english to basque. In Procceeding of LREC, Lisbon,Portugal, 2004.

[Alfonseca and Manandhar2002] E. Alfonseca and S. Manandhar. An unsupervisedmethod for general named entity recognition and automated concept discovery. In Pro-ceedings of the 1st International Conference on General WordNet, Mysore, India, 2002.

[Atserias et al.2001] J. Atserias, L. Padr, and G. Rigau. Integrating multiple knowledgesources for robust semantic parsing. In Proceedings of the International Conference,Recent Advances on Natural Language Processing RANLP’01, Tzigov Chark, Bulgaria,2001.

[Atserias et al.2003] Jordi Atserias, Mauro Castillo, Francis Real, Horacio Rodrıguez, andGerman Rigau. Exploring large-scale acquisition of multilingual semantic models forpredicates. In Proceedings of SEPLN’03, pages 39–46, Alcala de Henares, Spain, Septem-ber 2003. ISSN 1136-5948.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 35: MEANING Cycle 2: Acquisitionnlp/meaning/documentation/... · MEANING Cycle 2: Acquisition Document Number D5.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title

Work Package 5-D5.2 Version: FinalAcquisition Page : 34

[Atserias et al.2004] Jordi Atserias, Luıs Villarejo, German Rigau, Eneko Agirre, JohnCarroll, Bernardo Magnini, and Piek Vossen. The meaning multilingual central reposi-tory. In Proceedings of the Second International Global WordNet Conference (GWC’04),Brno, Czech Republic, January 2004. ISBN 80-210-3302-9.

[Bricoe and Carroll2002] E. Bricoe and J. Carroll. Robust accurate statistical annota-tion of general text. In Proceedings of the Third International Conference on LanguageResources and Evaluation (LREC 2002), Las Palmas, Canary Islands, 2002.

[Chugur and Gonzalo2002] I. Chugur and J. Gonzalo. A study of polysemy and senseproximity in the senseval-2 test suite. In Proceedings of ACL’02 workshop on d SenseDisambiguation: recent successes and future, Philadelphia, PA, USA, 2002.

[Karypis2001] G. Karypis. Cluto. a clustering toolkit. Technical report, Department ofComputer Science Minneapolis, University of Minnesota, 2001.

[Leacock et al.1998] C. Leacock, M. Chodorow, and G. Miller. Using Corpus Statistics andWordNet Relations for Sense Identification. Computational Linguistics, 24(1):147–166,1998.

[Lin and Hovy2000] C. Lin and E. Hovy. The automated acquisition of topic signatures fortext summarization. In Proceedings of 18th International Conference of ComputationalLinguistics, COLING’00, 2000. Strasbourg, France.

[Lin1998] D. Lin. Automatic retrieval and clustering of similar words. In Proceedings ofCOLING-ACL’1998, Montreal, Canada, 1998.

[McCarthy et al.2003] D. McCarthy, B. Keller, and J. Carroll. Detecting a continuum ofcompositionality in phrasal verbs. In Proceedings of the ACL-SIGLEX Workshop onMultiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, 2003.

[Mihalcea and Moldovan1999] R. Mihalcea and I. Moldovan. An Automatic Method forGenerating Sense Tagged Corpora. In Proceedings of the 16th National Conference onArtificial Intelligence. AAAI Press, 1999.

[Ng1997] H. Ng. Getting Serious about Word Sense Disambiguation. In Proceedings of theSIGLEX Workshop, 1997.

[Widdows2003] D. Widdows. Orthogonal negation in vector spaces for modelling word-meanings and document retrieval. In Proceedings of 41th annual meeting of the Associ-ation for Computational Linguistics (ACL’2003), Saporo, Japan, 2003.

[Zhao and Karypis2001] Y. Zhao and G. Karypis. Criterion functions for document clus-tering: Experiments and analysis. Technical report, Department of Computer ScienceMinneapolis, University of Minnesota, 2001.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies