openlogos semantico-syntactic knowledge-rich bilingual dictionaries
DESCRIPTION
TRANSCRIPT
OpenLogos Seman-co-‐Syntac-c Knowledge-‐Rich Bilingual Dic-onaries Anabela Barreiro1, Fernando Ba0sta1,2, Ricardo Ribeiro1,2, Helena Moniz1,3, Isabel Trancoso1,4
1INESC-‐ID, 2ISCTE-‐IUL, 3FLUL/CLUL, 4IST {abarreiro;fmmb;rdmr;helenam;imt}@l2f.inesc-id.pt!
http://www.l2f.inesc-id.pt/!
Characteris0cs – Representa0on schema with eclec0c categories – Designed to work in concert with the lexical resources and linguis0c rules (transfer (TRAN) and seman0co-‐syntac0c (SEMTAB) rules)
– Easy mapping from natural to symbolic language, represen0ng both meaning and structure in a con0nuum, undissociated, represented in the same layer, based on the belief that seman0cs of a word oRen affects the surrounding syntax
– Extensible system, designed so that developers would expand and add to its capabili0es
– Ini0ally developed for English, but many of its elements are universal (mostly nouns, adjec0ves, and adverbs) and applicable to other languages
Representa0on – SAL knowledge is embedded in the dic0onary in the form of numeric codes (SAL mnemonics are used for easier understanding) • E.g. the noun (N) table has two SAL representa0ons: – COsurf – concrete, surface – INdata – informa0on, recorded data
– Nouns have 12 supersets. Superset measure (ME) has 3 sets and 11 subsets:
• SAL codes for nouns represent seman0c groupings, and are language independent, as concepts are transverse across languages
– Verbs are subdivided in 3 types: intransi0ve, weak transi0ve and strong transi0ve. Intransi0ve verbs have 3 supersets: mo0onal (INMO), opera0onal (INOP), and existen0al (INEX)
• Existen0al intransi0ve verbs include be and be-‐subs0tutes that take predicate nouns and adjec0ves
– Adjec-ves are classified in 2 types: descrip0ve and par0cipial, sub-‐classified according to syntac0c rela0onships with other words • syntac0c pa]erns for the descrip0ve pre-‐clausal good-‐type adjec0ves
– OpenLogos (OL) is the open source deriva0ve of the Logos machine transla0on (MT) system
– OL strength resides in its lexical resources, the knowledge-‐rich bilingual dic-onaries • contain seman0co-‐syntac0c knowledge and ontological rela0ons for all lexical entries represented at an abstract/higher level by the Seman0co-‐Syntac0c Abstrac0on Language – SAL
• present other idiosyncrasies that dis0nguish them from other publicly available dic0onaries
Mo0va0on – OL resources were used successfully in the Logos commercial MT product during 2-‐3 decades • validated by the Logos development team and clients
– Possible applica0ons • basis for new linguis0c and NLP tools, especially for poor-‐resourced languages
• enhancement of other MT systems
Bilingual Dic0onaries: EN > GE/FR/IT
– Verbs, nouns and adjec0ves are clearly the most represented classes, as they reach more than 80,000 entries for each target language.
– Dic0onaries stored in self-‐contained XML files • easily addressed by small programs • supported by exis0ng efficient XML APIs
– Example for the verb entry depart, extracted from the English-‐French dic0onary
Introduc0on Seman0co-‐Syntac0c Knowledge
– Part-‐of-‐speech (POS) – Gender (GEN) – Number (NUM) – Morphological paradigms (PAT) for source and target words • make it possible to map inflected forms across languages and improve agreement in SMT
– Head word (HEAD) in mul0word • useful to correct MT problems related to agreement within mul0words or within larger units (e.g. between nominal mul0words and verb or agreement within verbal mul0words)
– Homographs (HOMO) • homographs are a major source of transla0on errors and their iden0fica0on is crucial
– Auxiliary (AUX) • helps improve precision in the transla0on when auxiliary choice is subtle
– Alternate word (ALT) • nominaliza0on (process noun), predicate adjec0ve, etc. -‐ useful for paraphrasing purposes
– Causa0ve verb (CAUS) – Reflexive verb (REFL) – Aspectual verb (ASP) – Seman0co-‐Syntac0c Knowledge (SAL) • interlingua-‐style hierarchical taxonomy with over 1,000 elements, embracing all POS
• 3 levels of representa0on: superset (SUPER), set (SET), and subset (SUB) -‐ embedded in the dic0onary entries and in the transla0on system’s rules (help with disambigua0on). E.g. pipe, hose:
OpenLogos Data
3
2
1
– Three bilingual dic0onaries were created • English-‐French; English-‐German; English-‐Italian • online and free for research purposes – h]p://metanet4u.l2f.inesc-‐id.pt/
– The resources contain seman0co-‐syntac0c knowledge concerning the conceptual formaliza0on of things, ideas, rela0onships, disposi0ons, condi0ons, processes, etc. • valuable for MT and other NLP applica0ons • stored in XML format for easy processing
– In the future, we will make available three complementary bilingual dic0onaries • English-‐Portuguese; English-‐Spanish; German-‐English
Acknowledgments – This work was supported by na0onal funds through
Fundação para a Ciência e a Tecnologia, under grants SFRH/BPD/91446/2012 and SFRH/BPD/95849/2013 and project PEst-‐OE/EEI/LA0021/2013
Conclusions and Future Work 5
Resul0ng Resources 4Instituto de Engenharia de Sistemas e Computadores
Investigação e Desenvolvimento em Lisboa
Laboratório de Sistemas de Língua Falada
id EN-‐GE EN-‐FR EN-‐IT Noun 1 28266 25910 23505 Verb 2 33855 33354 33021 Adverb (loca0ve) 3 465 442 450 Adjec0ve 4 21219 20749 20518 Pronoun 5 121 121 121 Adverb (manner, agency, degree) 6 2207 2167 2173 Preposi0on (non-‐loca0ve) 11 140 140 139 Auxiliary and Modal 12 34 34 34 Preposi0on (loca0ve) 13 148 148 148 Definite Ar0cle 14 194 194 189 Indefinite Ar0cle 15 66 66 65 Arithmate in Apposi0on 16 208 208 203 Nega0ve 17 2 2 2 Rela0ve and Interroga0ve Pronoun 18 23 23 20 Conjunc0on 19 160 160 160 Punctua0on 20 30 30 30 Total 87138 83748 80778
nouns%
concrete%
func+onals%
conduits%
word%class%
superset%
set%
subset%barriers% containers%
…%…%
…% …%
…%…%
<Entry source="depart" target="qui]er"> <source head_word="1" homograph="no" word_type="01"> <pos descrip0on="Verb" wclass="02"/> <morphology> <inflec0on descrip0on="like walk, walked, walking" example="walk" id="1"/> </morphology> <sal code="13,98,596" descrip0on="create, etc." mnemonic="generictransi0ve4" set="other98"/> </source> <target aux="1" head_word="1" word_type="01"> <pos descrip0on="Verb" wclass="02"/> <morphology> <inflec0on descrip0on="regular ending in -‐er: parler" example="parler" id="3"/> </morphology> </target> </Entry> <Entry source="depart" target="par0r"> <source head_word="1" homograph="no" word_type="01"> <pos descrip0on="Verb" wclass="02"/> <morphology> <inflec0on descrip0on="like walk, walked, walking" example="walk" id="1"/> </morphology> <sal code="10,24,596" descrip0on="from = away from, off of, out of" set="governsawayfrom"/> </source> <target aux="2" head_word="1" word_type="01"> <pos descrip0on="Verb" wclass="02"/> <morphology> <inflec0on descrip0on="Irreg. in -‐ir with shortened stem ..." example="par0r" id="12"/> </morphology> </target> </Entry>
Mnemonic Example Verb Example Sentence INEXbe-‐type be She was at the seashore all summer. INEXbecome-‐type become, remain He became a doctor at a very young age. INEXgrow-‐type sound, look Their voices sounded cheerful. INEXseem-‐type seem, appear He seemed happy with the results.
Mnemonics Descrip-on Examples MEabs abstract measurable concepts humidity, length MEdis discrete measurable concepts sum, increment MEunit units of measure See subsets MEunitwt units of weight ounce, pound MEunitvel units of velocity mph, megahertz MEunitvol unites of volume measure gallon, liter MEuni]emp units of temperature degrees celsius MEunitener units of energy/force wa], horsepower MEunitsys measurement systems fahrenheit, kelvin MEunitdur units of dura0on hour, year MEunitspec specialized units of measure oersted, ohm MEunitvalue units of money/value dollar, euro MEunitlin units of linear/area measure inch, mille MEundif undifferen0ated measure degree, share
PaQern Example Sentence It is ADJ that It is silly that... It is ADJ for NP that It is good for the employees that... It is ADJ to VP It is smart to exercise. It is ADJ for NP to VP It was silly for them to expect... It is ADJ V'ing It is smart doing the right thing. NP is ADJ to VP John is smart to exercise.