openlogos semantico-syntactic knowledge-rich bilingual dictionaries

1
OpenLogos Seman-coSyntac-c KnowledgeRich Bilingual Dic-onaries Anabela Barreiro 1 , Fernando Ba0sta 1,2 , Ricardo Ribeiro 1,2 , Helena Moniz 1,3 , Isabel Trancoso 1,4 1 INESCID, 2 ISCTEIUL, 3 FLUL/CLUL, 4 IST {abarreiro;fmmb;rdmr;helenam;imt}@l2f.inesc-id.pt http://www.l2f.inesc-id.pt/ Characteris0cs Representa0on schema with eclec0c categories Designed to work in concert with the lexical resources and linguis0c rules (transfer (TRAN) and seman0cosyntac0c (SEMTAB) rules) Easy mapping from natural to symbolic language, represen0ng both meaning and structure in a con0nuum, undissociated, represented in the same layer, based on the belief that seman0cs of a word oRen affects the surrounding syntax Extensible system, designed so that developers would expand and add to its capabili0es Ini0ally developed for English, but many of its elements are universal (mostly nouns, adjec0ves, and adverbs) and applicable to other languages Representa0on SAL knowledge is embedded in the dic0onary in the form of numeric codes (SAL mnemonics are used for easier understanding) E.g. the noun (N) table has two SAL representa0ons: COsurf – concrete, surface INdata – informa0on, recorded data Nouns have 12 supersets. Superset measure (ME) has 3 sets and 11 subsets: SAL codes for nouns represent seman0c groupings, and are language independent, as concepts are transverse across languages Verbs are subdivided in 3 types: intransi0ve, weak transi0ve and strong transi0ve. Intransi0ve verbs have 3 supersets: mo0onal (INMO), opera0onal (INOP), and existen0al (INEX) Existen0al intransi0ve verbs include be and be subs0tutes that take predicate nouns and adjec0ves Adjec-ves are classified in 2 types: descrip0ve and par0cipial, subclassified according to syntac0c rela0onships with other words syntac0c pa]erns for the descrip0ve preclausal goodtyp e adjec0ves OpenLogos (OL) is the open source deriva0ve of the Logos machine transla0on (MT) system OL strength resides in its lexical resources, the knowledgerich bilingual dic-onaries contain seman0cosyntac0c knowledge and ontological rela0ons for all lexical entries represented at an abstract/higher level by the Seman0co Syntac0c Abstrac0on Language – SAL present other idiosyncrasies that dis0nguish them from other publicly available dic0onaries Mo0va0on OL resources were used successfully in the Logos commercial MT product during 23 decades validated by the Logos development team and clients Possible applica0ons basis for new linguis0c and NLP tools, especially for poorresourced languages enhancement of other MT systems Bilingual Dic0onaries: EN > GE/FR/IT Verbs, nouns and adjec0ves are clearly the most represented classes, as they reach more than 80,000 entries for each target language. Dic0onaries stored in selfcontained XML files easily addressed by small programs supported by exis0ng efficient XML APIs Example for the verb entry depart, extracted from the EnglishFrench dic0onary Introduc0on Seman0coSyntac0c Knowledge Partofspeech (POS) Gender (GEN) Number (NUM) Morphological paradigms (PAT) for source and target words make it possible to map inflected forms across languages and improve agreement in SMT Head word (HEAD) in mul0word useful to correct MT problems related to agreement within mul0words or within larger units (e.g. between nominal mul0words and verb or agreement within verbal mul0words) Homographs (HOMO) homographs are a major source of transla0on errors and their iden0fica0on is crucial Auxiliary (AUX) helps improve precision in the transla0on when auxiliary choice is subtle Alternate word (ALT) nominaliza0on (process noun), predicate adjec0ve, etc. useful for paraphrasing purposes Causa0ve verb (CAUS) Reflexive verb (REFL) Aspectual verb (ASP) Seman0coSyntac0c Knowledge (SAL) interlinguastyle hierarchical taxonomy with over 1,000 elements, embracing all POS 3 levels of representa0on: superset (SUPER), set (SET), and subset (SUB) embedded in the dic0onary entries and in the transla0on system’s rules (help with disambigua0on). E.g. pipe, hose: OpenLogos Data 3 2 1 Three bilingual dic0onaries were created EnglishFrench; EnglishGerman; EnglishItalian online and free for research purposes h]p://metanet4u.l2f.inescid.pt/ The resources contain seman0cosyntac0c knowledge concerning the conceptual formaliza0on of things, ideas, rela0onships, disposi0ons, condi0ons, processes, etc. valuable for MT and other NLP applica0ons stored in XML format for easy processing In the future, we will make available three complementary bilingual dic0onaries EnglishPortuguese; EnglishSpanish; German English Acknowledgments This work was supported by na0onal funds through Fundação para a Ciência e a Tecnologia, under grants SFRH/BPD/91446/2012 and SFRH/BPD/95849/2013 and project PEstOE/EEI/LA0021/2013 Conclusions and Future Work 5 Resul0ng Resources 4 Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Laboratório de Sistemas de Língua Falada id ENGE ENFR ENIT Noun 1 28266 25910 23505 Verb 2 33855 33354 33021 Adverb (loca0ve) 3 465 442 450 Adjec0ve 4 21219 20749 20518 Pronoun 5 121 121 121 Adverb (manner, agency, degree) 6 2207 2167 2173 Preposi0on (nonloca0ve) 11 140 140 139 Auxiliary and Modal 12 34 34 34 Preposi0on (loca0ve) 13 148 148 148 Definite Ar0cle 14 194 194 189 Indefinite Ar0cle 15 66 66 65 Arithmate in Apposi0on 16 208 208 203 Nega0ve 17 2 2 2 Rela0ve and Interroga0ve Pronoun 18 23 23 20 Conjunc0on 19 160 160 160 Punctua0on 20 30 30 30 Total 87138 83748 80778 nouns concrete func+onals conduits word class superset set subset barriers containers <Entry source="depart" target="qui]er"> <source head_word="1" homograph="no" word_type="01"> <pos descrip0on="Verb" wclass="02"/> <morphology> <inflec0on descrip0on="like walk, walked, walking" example="walk" id="1"/> </morphology> <sal code="13,98,596" descrip0on="create, etc." mnemonic="generictransi0ve4" set="other98"/> </source> <target aux="1" head_word="1" word_type="01"> <pos descrip0on="Verb" wclass="02"/> <morphology> <inflec0on descrip0on="regular ending in er: parler" example="parler" id="3"/> </morphology> </target> </Entry> <Entry source="depart" target="par0r"> <source head_word="1" homograph="no" word_type="01"> <pos descrip0on="Verb" wclass="02"/> <morphology> <inflec0on descrip0on="like walk, walked, walking" example="walk" id="1"/> </morphology> <sal code="10,24,596" descrip0on="from = away from, off of, out of" set="governsawayfrom"/> </source> <target aux="2" head_word="1" word_type="01"> <pos descrip0on="Verb" wclass="02"/> <morphology> <inflec0on descrip0on="Irreg. in ir with shortened stem ..." example="par0r" id="12"/> </morphology> </target> </Entry> Mnemonic Example Verb Example Sentence INEXbetype be She was at the seashore all summer. INEXbecometype become, remain He became a doctor at a very young age. INEXgrowtype sound, look Their voices sounded cheerful. INEXseemtype seem, appear He seemed happy with the results. Mnemonics Descrip-on Examples MEabs abstract measurable concepts humidity, length MEdis discrete measurable concepts sum, increment MEunit units of measure See subsets MEunitwt units of weight ounce, pound MEunitvel units of velocity mph, megahertz MEunitvol unites of volume measure gallon, liter MEuni]emp units of temperature degrees celsius MEunitener units of energy/force wa], horsepower MEunitsys measurement systems fahrenheit, kelvin MEunitdur units of dura0on hour, year MEunitspec specialized units of measure oersted, ohm MEunitvalue units of money/value dollar, euro MEunitlin units of linear/area measure inch, mille MEundif undifferen0ated measure degree, share PaQern Example Sentence It is ADJ that It is silly that... It is ADJ for NP that It is good for the employees that... It is ADJ to VP It is smart to exercise. It is ADJ for NP to VP It was silly for them to expect... It is ADJ V'ing It is smart doing the right thing. NP is ADJ to VP John is smart to exercise.

Upload: anabela-barreiro

Post on 05-Dec-2014

194 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: OpenLogos Semantico-Syntactic Knowledge-Rich Bilingual Dictionaries

OpenLogos  Seman-co-­‐Syntac-c  Knowledge-­‐Rich  Bilingual  Dic-onaries  Anabela  Barreiro1,  Fernando  Ba0sta1,2,  Ricardo  Ribeiro1,2,  Helena  Moniz1,3,  Isabel  Trancoso1,4  

1INESC-­‐ID,  2ISCTE-­‐IUL,  3FLUL/CLUL,  4IST  {abarreiro;fmmb;rdmr;helenam;imt}@l2f.inesc-id.pt!

http://www.l2f.inesc-id.pt/!

Characteris0cs  –  Representa0on  schema  with  eclec0c  categories  –  Designed  to  work  in  concert  with  the  lexical  resources  and  linguis0c  rules  (transfer  (TRAN)  and  seman0co-­‐syntac0c  (SEMTAB)  rules)  

–  Easy  mapping  from  natural  to  symbolic  language,  represen0ng  both  meaning  and  structure  in  a  con0nuum,  undissociated,  represented  in  the  same  layer,  based  on  the  belief  that  seman0cs  of  a  word  oRen  affects  the  surrounding  syntax  

–  Extensible  system,  designed  so  that  developers  would  expand  and  add  to  its  capabili0es  

–  Ini0ally  developed  for  English,  but  many  of  its  elements  are  universal  (mostly  nouns,  adjec0ves,  and  adverbs)  and  applicable  to  other  languages  

Representa0on  –  SAL  knowledge  is  embedded  in  the  dic0onary  in  the  form  of  numeric  codes  (SAL  mnemonics  are  used  for  easier  understanding)  •  E.g.  the  noun  (N)  table  has  two  SAL  representa0ons:  –  COsurf  –  concrete,  surface  –  INdata  –  informa0on,  recorded  data  

–  Nouns  have  12  supersets.  Superset  measure  (ME)  has  3  sets  and  11  subsets:    

•  SAL  codes  for  nouns  represent  seman0c  groupings,  and  are  language  independent,  as  concepts  are  transverse  across  languages  

–  Verbs  are  subdivided  in  3  types:  intransi0ve,  weak  transi0ve  and  strong  transi0ve.  Intransi0ve  verbs  have  3  supersets:  mo0onal  (INMO),  opera0onal  (INOP),  and  existen0al  (INEX)  

•  Existen0al  intransi0ve  verbs  include  be  and  be-­‐subs0tutes  that  take  predicate  nouns  and  adjec0ves  

–  Adjec-ves  are  classified  in  2  types:  descrip0ve  and  par0cipial,  sub-­‐classified  according  to  syntac0c  rela0onships  with  other  words  •  syntac0c  pa]erns  for  the  descrip0ve  pre-­‐clausal  good-­‐type  adjec0ves  

 

–  OpenLogos  (OL)  is  the  open  source  deriva0ve  of  the  Logos  machine  transla0on  (MT)  system    

–  OL  strength  resides  in  its  lexical  resources,  the  knowledge-­‐rich  bilingual  dic-onaries  •  contain  seman0co-­‐syntac0c  knowledge  and  ontological  rela0ons  for  all  lexical  entries  represented  at  an  abstract/higher  level  by  the  Seman0co-­‐Syntac0c  Abstrac0on  Language  –  SAL    

•  present  other  idiosyncrasies  that  dis0nguish  them  from  other  publicly  available  dic0onaries  

Mo0va0on  –  OL  resources  were  used  successfully  in  the  Logos  commercial  MT  product  during  2-­‐3  decades  •  validated  by  the  Logos  development  team  and  clients  

–  Possible  applica0ons  •  basis  for  new  linguis0c  and  NLP  tools,  especially  for  poor-­‐resourced  languages  

•  enhancement  of  other  MT  systems  

Bilingual  Dic0onaries:  EN  >  GE/FR/IT  

–  Verbs,  nouns  and  adjec0ves  are  clearly  the  most  represented  classes,  as  they  reach  more  than  80,000  entries  for  each  target  language.  

–  Dic0onaries  stored  in  self-­‐contained  XML  files  •  easily  addressed  by  small  programs  •  supported  by  exis0ng  efficient  XML  APIs  

–  Example  for  the  verb  entry  depart,  extracted  from  the  English-­‐French  dic0onary  

Introduc0on   Seman0co-­‐Syntac0c  Knowledge  

–  Part-­‐of-­‐speech  (POS)  –  Gender  (GEN)  –  Number  (NUM)  –  Morphological  paradigms  (PAT)  for  source  and  target  words  •  make  it  possible  to  map  inflected  forms  across  languages  and  improve  agreement  in  SMT  

–  Head  word  (HEAD)  in  mul0word  •  useful  to  correct  MT  problems  related  to  agreement  within  mul0words  or  within  larger  units  (e.g.  between  nominal  mul0words  and  verb  or  agreement  within  verbal  mul0words)  

–  Homographs  (HOMO)  •  homographs  are  a  major  source  of  transla0on  errors  and  their  iden0fica0on  is  crucial  

–  Auxiliary  (AUX)  •  helps  improve  precision  in  the  transla0on  when  auxiliary  choice  is  subtle  

–  Alternate  word  (ALT)  •  nominaliza0on  (process  noun),  predicate  adjec0ve,  etc.  -­‐  useful  for  paraphrasing  purposes  

–  Causa0ve  verb  (CAUS)  –  Reflexive  verb  (REFL)  –  Aspectual  verb  (ASP)  –  Seman0co-­‐Syntac0c  Knowledge  (SAL)  •  interlingua-­‐style  hierarchical  taxonomy  with  over  1,000  elements,  embracing  all  POS  

•  3  levels  of  representa0on:  superset  (SUPER),    set  (SET),  and  subset  (SUB)  -­‐  embedded  in  the  dic0onary  entries  and  in  the  transla0on  system’s  rules  (help  with  disambigua0on).  E.g.  pipe,  hose:  

OpenLogos  Data  

3

2

1

–  Three  bilingual  dic0onaries  were  created  •  English-­‐French;  English-­‐German;  English-­‐Italian  •  online  and  free  for  research  purposes    –  h]p://metanet4u.l2f.inesc-­‐id.pt/  

–  The  resources  contain  seman0co-­‐syntac0c  knowledge  concerning  the  conceptual  formaliza0on  of  things,  ideas,  rela0onships,  disposi0ons,  condi0ons,  processes,  etc.  •  valuable  for  MT  and  other  NLP  applica0ons  •  stored  in  XML  format  for  easy  processing  

–  In  the  future,  we  will  make  available  three  complementary  bilingual  dic0onaries  •  English-­‐Portuguese;  English-­‐Spanish;  German-­‐English  

Acknowledgments  –  This  work  was  supported  by  na0onal  funds  through  

Fundação  para  a  Ciência  e  a  Tecnologia,  under  grants  SFRH/BPD/91446/2012  and  SFRH/BPD/95849/2013    and  project  PEst-­‐OE/EEI/LA0021/2013  

Conclusions  and  Future  Work   5

Resul0ng  Resources   4Instituto de Engenharia de Sistemas e Computadores

Investigação e Desenvolvimento em Lisboa

Laboratório de Sistemas de Língua Falada

    id   EN-­‐GE   EN-­‐FR   EN-­‐IT  Noun   1   28266   25910   23505  Verb   2   33855   33354   33021  Adverb  (loca0ve)   3   465   442   450  Adjec0ve   4   21219   20749   20518  Pronoun   5   121   121   121  Adverb  (manner,  agency,  degree)   6   2207   2167   2173  Preposi0on  (non-­‐loca0ve)   11   140   140   139  Auxiliary  and  Modal   12   34   34   34  Preposi0on  (loca0ve)   13   148   148   148  Definite  Ar0cle   14   194   194   189  Indefinite  Ar0cle   15   66   66   65  Arithmate  in  Apposi0on   16   208   208   203  Nega0ve   17   2   2   2  Rela0ve  and  Interroga0ve  Pronoun   18   23   23   20  Conjunc0on   19   160   160   160  Punctua0on   20   30   30   30  Total   87138   83748   80778  

nouns%

concrete%

func+onals%

conduits%

word%class%

superset%

set%

subset%barriers% containers%

…%…%

…% …%

…%…%

 <Entry  source="depart"  target="qui]er">          <source  head_word="1"  homograph="no"  word_type="01">              <pos  descrip0on="Verb"  wclass="02"/>              <morphology>                  <inflec0on  descrip0on="like  walk,  walked,  walking"  example="walk"  id="1"/>              </morphology>              <sal  code="13,98,596"  descrip0on="create,  etc."  mnemonic="generictransi0ve4"  set="other98"/>          </source>          <target  aux="1"  head_word="1"  word_type="01">              <pos  descrip0on="Verb"  wclass="02"/>              <morphology>                  <inflec0on  descrip0on="regular  ending  in  -­‐er:  parler"  example="parler"  id="3"/>              </morphology>          </target>      </Entry>      <Entry  source="depart"  target="par0r">          <source  head_word="1"  homograph="no"  word_type="01">              <pos  descrip0on="Verb"  wclass="02"/>              <morphology>                  <inflec0on  descrip0on="like  walk,  walked,  walking"  example="walk"  id="1"/>              </morphology>              <sal  code="10,24,596"  descrip0on="from  =  away  from,  off  of,  out  of"  set="governsawayfrom"/>          </source>          <target  aux="2"  head_word="1"  word_type="01">              <pos  descrip0on="Verb"  wclass="02"/>              <morphology>                  <inflec0on  descrip0on="Irreg.  in  -­‐ir  with  shortened  stem  ..."  example="par0r"  id="12"/>              </morphology>          </target>      </Entry>  

Mnemonic   Example  Verb   Example  Sentence  INEXbe-­‐type   be   She  was  at  the  seashore  all  summer.  INEXbecome-­‐type   become,  remain   He  became  a  doctor  at  a  very  young  age.  INEXgrow-­‐type     sound,  look   Their  voices  sounded  cheerful.  INEXseem-­‐type   seem,  appear   He  seemed  happy  with  the  results.  

Mnemonics   Descrip-on   Examples  MEabs   abstract  measurable  concepts   humidity,  length  MEdis   discrete  measurable  concepts   sum,  increment  MEunit   units  of  measure   See  subsets  MEunitwt   units  of  weight   ounce,  pound  MEunitvel   units  of  velocity   mph,  megahertz  MEunitvol   unites  of  volume  measure   gallon,  liter  MEuni]emp   units  of  temperature   degrees  celsius  MEunitener   units  of  energy/force   wa],  horsepower  MEunitsys   measurement  systems   fahrenheit,  kelvin  MEunitdur   units  of  dura0on   hour,  year  MEunitspec   specialized  units  of  measure   oersted,  ohm  MEunitvalue   units  of  money/value   dollar,  euro  MEunitlin   units  of  linear/area  measure   inch,  mille  MEundif   undifferen0ated  measure   degree,  share  

PaQern   Example  Sentence  It  is  ADJ  that   It  is  silly  that...  It  is  ADJ  for  NP  that   It  is  good  for  the  employees  that...  It  is  ADJ  to  VP   It  is  smart  to  exercise.  It  is  ADJ  for  NP  to  VP   It  was  silly  for  them  to  expect...  It  is  ADJ  V'ing   It  is  smart  doing  the  right  thing.    NP  is  ADJ  to  VP   John  is  smart  to  exercise.