svetla koeva, max silbetztein 8th intex / nooj workshop, 30 may, 2005
DESCRIPTION
Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval. Svetla Koeva, Max Silbetztein 8th INTEX / NooJ Workshop, 30 May, 2005. Main research goals. - PowerPoint PPT PresentationTRANSCRIPT
Integrating Semantic Dictionaries for English, French and Bulgarian
into the NooJ System for the Purposes of Information Retrieval
Svetla Koeva, Max Silbetztein
8th INTEX / NooJ Workshop,
30 May, 2005
Main research goals
• To provide a sufficient methodology for the implementation of the natural language semantic relations into the NooJ system:– to create specialized Semantic Dictionaries for
English, French and Bulgarian based on WordNet semantic relations;
– to provide compete formalization of the inflection for simple and compound words included in the Wn structure.
History
• The integration of semantic relations into the INTEX system was initially proposed at the sixth INTEX workshop.
• Later on the idea was advanced into the Joint research RILA project
Information retrieval based on semantic relations
– LASELDI, Université de Franche-Comté – Department of Computational Linguistics, IBL,
Bulgarian Academy of Sciences.
Language resources
• Bulgarian grammatical dictionary (BGD) – over 83 000 lemmas and 1 100 000 word forms;
• English WordNet 2.0 – 115 424 synonymous sets;• Bulgarian WordNet (BalkaNet project) – 22 867
synonymous sets;• French WordNet (EuroWordNet project) – 33 512
synonymous sets;• English dictionary – over 30 000 lemmas (not
inflected);• French dictionary – extracted with INTEX.
Implementation tasks
• To transform the format of the BGD into the NooJ standard;
• To create semantic dictionaries for Bulgarian and English;
• To associate lemmas from the Bulgarian semantic dictionaries with the corresponding inflection types;
• To add missing lemmas and inflection types in BGD, if any;
• To create extensive dictionaries and corresponding inflection types for compounds.
BGD – Information structure design
• Category information –6 classes: Noun, Verb, Adjective, Pronoun, Numeral, Others (Adverb, Preposition, Conjunction, Particle, Interjection) ;
• Paradigmatic information – Personal, Transitive, Perfective, Common, …;
• Grammatical information – Inflection, Conjugation, Sound alternations, ….
BGD – Grammatical subclasses
• Nouns - 22 subclasses with respect of their Type (Common, Proper, Singularia tantum, Pluralia tantum) and Gender;
• Verbs – 32 subclasses with respect of Transitivity, Perfectiveness, and Personality;
• Adjectives – 2 subclasses;• Pronouns – 26 subclasses with respect of their
Type and Possessor;• Numerals – 6 sunclasses.
BGD – Grammatical types
• Noun – Number, Definiteness, Counting form, Case, Optional forms – 266 types;
• Verb – Person, Number, Tense, Mood, Voice, Participles, Gender, Definiteness – 257 types;
• Adjective – Gender, Number, Definiteness – 30 types;
• Pronoun – Gender, Person, Number, Definiteness, Case, Clitic, Possessing – 28 types;
• Numeral – Gender, Number, Definiteness, Approximate form, Male form – 20 types.
BGD – Dictionary format
а,ЧА,0 ПРИ, 7 sm0, Ok, ‘‘абсол`ютен, ПРИ, 7 smh, Ok, '2RCия‘`август, С+М, 10 sml, Ok, '2RCият‘авиокомп`ания, С+Ж, 1 sf0, Ok, '2RCа‘австр`ийски, ПРИ, 3 sfd, Ok, '2RCата‘автоб`ус, С+М, 11 sn0, Ok, '2RCо‘автомат`ичен, ПРИ, 7 snd, Ok, '2RCото‘адрес`ирам, Г+Н+Т, 4 p0, Ok, '2RCи‘агит`ирам, Г+Н+Т, 4 pd, Ok, '2RCите'
Transforming BGD
Perl Script
DictionaryGrammatical
types Transliteration
of labels
NooJ dictionary
→aбсол`ютен, ПРИ, 7 aбсолютен,A+FLX=A-7
`август, С+М, 10 август,N+M+FLX=N_M-10
авиокомп`ания, С+Ж,1 авиокомпания,N+F+FLX=N_F-1
aвстр`ийски, ПРИ, 3 aвстрийски,A+FLX=A-3
автоб`ус, С+М, 11 автобус,N+M+FLX=N_M-11
автомат`ичен, ПРИ, 7 автоматичен,A+FLX=A-7
адрес`ирам,Г+Н+Т,4 адресирам,V+IT+FLX=V_IT-4
NooJ formal descriptions
→sm0, Ok, ‘‘ A-7 = <E>/sm0 +smh, Ok, '2RCия‘ <L2><S><R>ия<S1>/smh + sml, Ok, '2RCият‘ <L2><S><R>ият<S1>/sml +sf0, Ok, '2RCа‘ <L2><S><R>а<S1>/sf0 +sfd, Ok, '2RCата‘ <L2><S><R>ата<S1>/sfd +sn0, Ok, '2RCо‘ <L2><S><R>о<S1>/sn0 +snd, Ok, '2RCото‘ <L2><S><R>ото<S1>/snd + p0, Ok, '2RCи‘ <L2><S><R>и<S1>/p0 + pd, Ok, '2RCите‘ <L2><S><R>ите<S1>/pd;
WordNet semantic relations
ILR POS/POS EW2.0 BulNet
HYPERONYMY N/N V/V 94 844 15 838
NEAR ANTONYMY N/N A/A V/V 7 642 1 847
PART MERONYMY N/N 8 636 1 241
MEMBER MERONYMY N/N 12 205 841
PORTION MERONYMY N/N 787 107
SUBEVENT V/V 409 162
CAUSES V/V 439 104
SIMILAR TO A/A V/V 22 196 1 479
VERB GROUP V/V 1 748 848
ALSO SEE A/A V/V 3 240 895
Other relations
ILR POS/POS EW2.0 BulNet
BE IN STATE A/N 1 296 591
BG DERIVATIVE N/V 36 630 6 469
DERIVED A/N 6 809 1 071
PARTICIPLE A/V 401 56
REGION DOMAIN N/N V/N A/N B/N 1 280 4
USAGE DOMAIN N/N V/N A/N B/N 983 22
CATEGORY DOMAIN N/N V/N A/N B/N 6 166 638
Selected relations
• Synonymy (reflexive, symmetric, and transitive relation of equivalence);
• Hypernymy (inverse, asymmetric, and transitive relation between synonym sets),
• Meronymy (inverse, asymmetric, and transitive relation between synonym sets):
Part meronymy;
Member meronymy;
Portion meronymy.
Selected relations
• Similar to (symmetric relation between similar adjectival synsets);
• Verb group (symmetric relation between semantically related verb synsets);
• Also see (symmetric relation between synsets - verbs or adjectives, that are close in meaning);
• Category domain (asymmetric extralinguistic relation between synsets denoting a concept and the sphere of knowledge it belongs to).
DELAF semantic dictionaries
• These dictionaries consist of pairs of literals defined for the corresponding semantic relation:– car,automobile.N
– auto,automibile.N
• All possible combinations between literals in the given synsets are listed: – car,automobile.N
– cars,automobile.N
– auto,automibile.N
– autos,automibile.N
NooJ Semantic dictionaries
Synonymy relation‘a plant consisting of buildings with facilities for
manufacturing’
фабрика,N+FLX=ENG20-03196165-nпредпрятие,N+FLX=ENG20-03196165-n
factory,N+FLX=ENG20-03196165-nmill,N+FLX=ENG20-03196165-nmanufacturing plant,N+FLX=ENG20-03196165-nmanufactory,N+FLX=ENG20-03196165-n
NooJ Semantic dictionaries
Hypernymy relation‘the organized action of making of goods and services
for sale’
производство,N+FLX=ENG20-00859333-nпромишленост,N+FLX=ENG20-00859333-nиндустрия,N+FLX=ENG20-00859333-n
production,N+FLX=ENG20-00859333-nindustry,N+FLX=ENG20-00859333-nmanufacture,N+FLX=ENG20-00859333-n
Inflecting wordnet<SYNSET>
<ID>...</ID><POS>...</POS><SYNONYM>
<LITERAL>otstranqwam (to remove)<SENSE>…</SENSE><LNOTEGR>ГНТ12</LNOTEGR>
</LITERAL></SYNONYM><ILR>...<TIPE>...</TYPE></ILR><DEF>
remove something concrete, as by lifting, pushing, taking off, etc. or remove something abstract </DEF><BCS>...</BCS>
</SYNSET>
NooJ Semantic descriptions
‘the organized action of making of goods and services for sale’
ENG20-00859333-n = <E>/Hs0 + то/Hsd + <L1>а<S1>/Hp0 + <L1>ата<S1>/Hpd + <L9>мишленост<S9>/Ss0 + <L9>мишлеността<S9>/Ssd + <L9>мишлености<S9>/Sp0 + <L9>мишленостите<S9>/Spd + <B12>индустрия/Ss0 + <B12>индустрията/Ssd + <B12>индустрии/Sp0 + <B12>индустриите/Spd;ENG20-00859333-n = <E>/Hs + <B10>industry/Ss + <B10>industries/Sp0+ <B10>manifactures/Ss + <B10>manifactures/Sp;
After the nice solutions
• Lemmas which are not included in the BGD:– Lemmas classification to existing inflection types;– Formal description of new inflection types– Literals in Latin;– Validating WordNet.
• Semantic ambiguity - literals with two inflectional descriptions in BGD;
• Compound words– Formal description of inflection types;– Compounds classification.
NooJ Compound semantic descriptions
ENG20-04182583-n = <E>/Ss0 + <P>та/Ssd + <B>и<P><B>(и/p0 +ите/pd) + <B7>завод<P><B2>ен/Ss0 + <B7>завод<P><B2>ния/Ssh + <B7>завод<P><B2>ният/Ssl + <B7>заводи<P><B2>ни/Sа0 + <B7>заводи<P><B2>ните/Sа0 + <B7>рафинерия/Ss0 + <B7>рафинерия<P>та/Ssd + <B7>рафинерии<P><B>и/Sp0 + <B7>рафинерии<P><B>ите/Spd;
Applications of the Semantic Dictionaries
• Information retrieval by means of semantic equivalence with synonymy dictionaries;
• Information retrieval by means of semantic specification with hyperonymy and meronymy dictionaries;
• Information retrieval by means of similarity;• Information retrieval by means thematic domains
affiliations;• Validation WordNet structure against its
completeness and consistency.
Future directions
• Extensions and enhancements of the semantic dictionaries by means of:– Extension of the dictionaries coverage;– Addition of other semantic relations;– Inclusion of additional information to the entries.
• Integration of multilingual semantic extraction with NooJ using the Inter-Lingual-Index relation.