constructing a romanian electronic dictionary andrei filip universitat autònoma de barcelona

25
Constructing a Romanian Electronic Dictionary Andrei Filip Universitat Autònoma de Barcelona

Upload: lonna

Post on 13-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Constructing a Romanian Electronic Dictionary Andrei Filip Universitat Autònoma de Barcelona. 1 . The Format Of the Romanian Electronic Dictionary. 1.1. The Macrostructure 1.2. The Microstructure 2. The Noun Inflection System. NooJ Graphs Implementation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

Constructing a Romanian Electronic

Dictionary

Andrei FilipUniversitat Autònoma de Barcelona

Page 2: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

1. The Format of the Romanian Electronic Dictionary:

1.1.The Macrostructure

- is composed by the different lexical units which make up the dictionary (in our case about 30 738 entries)

What makes it different from paper dictionaries?

In what we call traditional dictionaries, each entry generally corresponds to a basic unit form, therefore it implies the separation of syntax (structures in which the respective units can be combined) and lexicon (inventory of associated forms to one or more meanings).

1. The Format Of the Romanian Electronic Dictionary.

1.1. The Macrostructure

1.2. The Microstructure

2. The Noun Inflection System. NooJ Graphs Implementation

2.1. The Gender and Determination Issue

2.2. The Grammatical Category of Number

2.3.The Grammatical Category Of Case

Page 3: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

At least two major problems raise from this treatment as far as natural language processing is concerned:

a) polysemy

b) idiomatic expressions

Therefore, they describe either a part of the lexical unit or more lexical units at the same time.

The strategy to adopt is to consider the entry not as a form but as a lexical unit – which is made up by a form a, a meaning ‘a’ and a combinatory ∑a.

e.g. Este o veste însemnată.(une nouvelle importante)

Vaca care este însemnată îi aparţine. (marquée)

Este un om însemnat. (personne estropiée.)

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 4: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

In the previous sentences, each of the uses of the adjective “însemnat” is characterized by a combinatory and single meaning which correspond to an independent lexical unit.

Moreover, as we have already seen, each lexical unit corresponds to a different translation unit in the target language.

If we define the lexical units as such, lexical ambiguity is no longer a problem as each form corresponds to a single meaning.

We should also distinguish between simple and compound lexical units. For the time being we concentrate only in the Romanian dictionary of simple forms and leave behind for a further research the dictionary of compound lexical forms.

1.TheFormat of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 5: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

We should also mention here that spelling variants have been treated separately, that is they are given a new different entry and description in the dictionary.

e.g. atunci/atuncea; acum/acuma; flutur/fluture

We have also approached a different perspective as far as gender is regarded. For instance, we have given different entries for the masculine and feminine nouns (what we could also term as correlative nouns) :

e.g. bunic-bunică; copil-copilă; cuscru-cuscră; cumnat – cumnată; profesor – profesoară; italian – italiancă; leu – leoaică; ţăran – ţărancă; doctor – doctoriţă; cârciumar – cârciumăreasă, păun – păuniţă etc.

Therefore they also correspond to different inflection graphs and do not come out as inflections of the corresponding masculine noun. The aim is also to facilitate the lexicographical treatment of natural gender.

1.The Format of the Romanian Electronic Dictionary

1.1. The Macrostructure

1.2. The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2. The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 6: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

1.2. The Microstructure

The microstructure of an electronic dictionary is made up by the different lexicographic information which is mentioned, that is information on the lemma, on its possible arguments and on lexical units related from a semantic point of view to the respective lemma (i.e. lexical restrictions and translation equivalents).

All this information is divided in the different descriptive fields of the data base.

Each entry is characterised first of all according to its morphologic description (G field). It corresponds to the different inflection graphs that characterise the parts of speech: N, A, V, ADV, PREP, DET, PRO and Residual. According to the inflection codes we attach to each entry we can also make out information on gender for instance.

1.1.The Format of the Romanian Electronic Dictionary.

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 7: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

The next field, T, provides the information about the syntactico-semantic features of each entry. They concern mainly nouns. We distinguish between Hum, Inc, Anl, Veg, Loc, Tps, and Abs (which is further subdivided into states, actions and events).

The fourth field, C is reserved to the “classes d’objets” (Gross, G. 1994; Le Pesant et Mathieu Colas, 1998). They have been established from the syntactic characteristics of the lexical units. A class of elementary arguments is defined by the predicates which select arguments belonging to the same class of objects. The superior order predicates which accept other predicates in their argument domain are also regrouped in “classes d’objets”. For the time being 59 classes have been implemented in our dictionary.

e.g. cântăreaţă: C: artist

1. The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 8: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

So as to provide more precision to our description we shall also include the D field, which corresponds to the domains that have been accurately described by the Laboratoire de Linguistique Informatique of Paris 13 (about 91).

“un ensemble d’expressions dénommant dans une langue naturelle des notions relevant d’un domaine de connaissance thématisé” (Lerat, 1995)

This kind of description will allow us to disambiguate polysemantic lexical units.

For further precision, the field SD (subdomain) has been introduced.

e.g. cineast D: cinema-photography SD: cinema

1. The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 9: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

Our next field corresponds to the translation equivalent (Fr/Es). It is highly important to state that we do not consider this field as a metalinguistic information relative to one lexical unit but rather as a pointer to another lexical unit which has a corresponding linguistic description in the target language dictionary.

Our aim is creating monolingual coordinated electronic dictionaries (cf. Blanco 2001) as in most cases the morphological and syntactic description differ from one language to the other.

We have also introduced a further field P (cf. Garrigues 1997) so as to account for the use a speaker would give to one lexical entry or the other. Two criteria are taken into consideration when it comes to this field:

- we consider the (non)existence of a mental image of a given word in the mental lexicon of a person;

- we consider how often a given word would occur in everyday speech (we refer here not to the form but to the association form/meaning).

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation.

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3The Grammatical Category of Case

Page 10: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

A final field is to be introduced and it has to do to with what Hausmann (1989) calls “diasystematics”.

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

DIASTÈME MARQUES

diastratique soutenu, familier, vulgaire

diatopique américanisme, dialectal

diachronique vieilli, néologisme

diaintégratif latinisme, argot

dianominatif incorect

diaconnotatif péjoratif, enfantin

diamétrique oral, écrit

diaphasique formel, informel

diatextuel journalistique, administratif, littéraire

diatechnique langue spécialisé

diafréquence rare

Page 11: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

As far as gender is considered, we distinguish three main classes in Romanian:

•Masculine: un frate – doi fraţi

•Feminine: o colegă – două colege

•Neuter: un drum – două drumuri

From a morphologic point of view neutre nouns behave like a masculine noun in the singular and as a feminine in the plural. Therefore they will select different operators according to number.

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. Nooj Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 12: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

From a semantic point of view we could assert that it is quite a homogenous class as it includes mostly Inc nouns (e.g. ciocan – ciocane), HumColl nouns (e.g. popor, trib, grup, colectiv etc.) and Anl which denote the species (e.g. mamifer, gasteropod, dobitoc).

As far as the grammatical category of determination is taken into account we shall concentrate here only on the definite article. All the other Det have their own inflection system depending either on the case and on whether they precede or not the NG.

The definite article in Romanian is an adjoined enclitic morpheme which needs to be described in the inflection graph:

e.g. studentul , steaua , cartea, regele, codrul

ţară – ţara, popă – popa, poezie – poezia etc.

1.The Format of the Romanian Electronic

Dictionary

1.1.The Macrostructure

1.2. The Microstructure

2. The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 13: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

As far as the plural nouns are concerned, the definite article morpheme depends only on the gender of the corresponding noun:

• “i” for the masculine nouns:

e.g. studenţi – studenţii; fraţi – fraţii; copaci – copacii

• “le” for the feminine and neuter nouns:

e.g. studente – studentele; poezii – poeziile

popoare – popoarele; sigilii – sigiliile.

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of case

Page 14: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

2.2. The Grammatical Category of Number

When it comes to inflectional morphemes that designate the opposition singular-plural, we could distinguish three main classes of nouns in Romanian:

a) Variable nouns with a regular inflection paradigm:

e.g. casă – case; şcolar – şcolari; drum – drumuri;

b) Variable nouns with an irregular inflection paradigm:

e.g. om – oameni; soră – surori;

c) Invariable nouns:

e.g. tei – tei; învăţătoare ; pronume

So far we have created 11 different inflection graphs for masculine nouns, 16 for the feminine and 9 for neuter nouns.

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 15: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

We need to add that several nouns have two plural forms (especially feminine and neuter ones):

e.g. coală – coli/coale; vreme – vremi/vremuri

chibrit – chibrituri/chibrite; hotel – hoteluri/hotele.

However, in some cases there is a different lexico-semantic description that we should add to these nouns. As a matter of fact we speak about the same form, but different meaning and combinatory. Therefore they are going to be treated under different entries in our dictionary.

e.g. corn – coarne vs. corn - cornuri

mâncare – mâncări vs. mâncare - mâncăruri

A special attention should be paid to Singularia Tantum and Pluralia Tantum nouns. The strategy we adopt is to mention the fact that they are devoid of this inflection feature in the graph when we label the entry in the G field.

e.g. ochelari N11P moaşte N23P

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number .

2.3.The Grammatical Category of Case

Page 16: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

2.3. The Grammatical Category of Case

A third main factor we have to consider when building up our inflection graph is case. From the point of view of the internal structure, nouns can be grouped in the same three main classes determined by the number opposition:

•Variable nouns with a regular inflection pattern;

•Variable nouns with an irregular inflection pattern;

•Invariable nouns.

Let’s first consider nouns in the Nominative and the Accusative. They can either be inflected or not with the enclitic definite article (om – omul, oameni – oamenii, casă, case – casele etc.).

The uninflected noun can be accompanied or not by the indefinite article or any other determinant which takes over the inflection pattern.

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 17: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

The main noun forms in the Nominative/Accusative are:

a) With the definite article:

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Sg. Masc. omul fratele codrul fiul

Neutre dealul teatrul fluviul paiul numele

Fem. casa basmaua ziua vulpea câmpia

Pl. Masc. oamenii fraţii codri fiii

Neutre dealurile teatrele fluviile paiele numele

Fem. casele basmalele zilele vulpile câmpiile

Page 18: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

b) With the indefinite article:

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Sg.

Masc. un om frate codru fiu

Neuter un deal teatru fluviu pai nume

Fem. o casă basma zi vulpe câmpie

Pl. Masc. nişte oameni fraţi codri fii

Neuter nişte dealuri teatre fluvii paie nume

Fem. nişte case basmale

zile vulpi câmpii

Page 19: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

There were plenty of ortographic constraints that we had to consider when concieving our inflection graphs but for the sake of concision we are not going to enter in detail here.

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Numbre

2.3.The Grammatical Category of Case

N/Ac sg. with the indefinite article

N/Ac pl. with indefinite article

N/Ac pl. with the definite article

(un) fiu (nişte) fii fiii

(un) surugiu (nIşte) surugii surugiii

(un) uliu (nişte) ulii uliii

(un) copil (nişte) copii copiii

Page 20: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

As far as the Genitive and the Dative are taken into account we distinguish the following main forms:

a) Articulated Forms:

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case .

Sg. Masc. omului fratelui codrului fiului

Neuter dealului tatrului fluviului numelui

Fem. casei basmalei vulpii câmpiei

Pl. Masc. oamenilor fraţilor codrilor fiilor

Neuter dealurilor teatrelor fluviilor numelor

Fem. caselor basmalelor vulpilor câmpiilor

Page 21: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

b) Unarticulated Forms

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Sg.

Masc. unui om frate codru fiu

Neuter unui deal teatru fluviu nume

Fem. unei case basmale vulpi câmpii

Pl. Masc. unor oameni fraţi codri fii

Neuter unor dealuri teatre fluvii nume

Fem. unor case basmale vulpi câmpii

Page 22: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

We have to note that the only nouns that change their forms in the Dative and Genitive are feminine nouns in the singular:

e.g. casă – casei / unei case

basma – basmalei / unei basmale

vulpe – vulpii / unei vulpi.

In this case the Dative and the Genitive in the sg. are indicated both by the form that the noun takes and by the form of the inflected ( the definite article “-i” and the indefinite article “unei”).

As in the case of the Nominative/Accusative nouns, we also have to deal here with exceptions from an orthographic point of view. We can identify four main types but we are to refer here only to one example.

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical category of Number

2.3.The Grammatical Category of Case

Page 23: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

Feminine nouns ending in the Nominative sg. in vowel or diphthong are written with final “-ei” or “-ii” when they are inflected with the definite article. In order not to get confused, we would rather use the form of the unarticulated noun in the Nominative pl.

e.g. N.pl.unart. D/G sg. unart. D/G sg.art

(nişte) case (unei) case casei

vulpi vulpi vulpii

femei femei femeii

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 24: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

Finally, when it comes to the nouns in the Vocative, we can distinguish four different cases:

1. There are some nouns which have specific forms for the Vocative:

e..g. bărbate; cumetre; bunicule (masculine)

bunico; cuscro (feminine)

2. Some nouns can have specific Vocative forms, but they also accept an alternative form which is identical with that in the Nominative/Accusative inflected form:

e.g. bunico - bunica

3. The majority of nouns have specific forms for the Vocative case but when they want to emphasize the appellative function we use the same form as for the Nominative/Accusative uninflected nouns:

e.g. frate; tată; mamă

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2..2.The Grammatical Category of Number

2.3.The Grammatical Category of Case

Page 25: Constructing a Romanian Electronic  Dictionary Andrei Filip Universitat Autònoma de Barcelona

4. For the masculine and feminine plural nouns in the Vocative we use the same forms as for the Nominative/Accusative uninflected nouns or Genitive/Dative inflected forms:

e.g. Veniţi, fraţi!

Staţi, fraţilor/vecinilor/fetelor!

With the support of the

Universitat Autònoma de Barcelona

1.The Format of the Romanian Electronic Dictionary

1.1.The Macrostructure

1.2.The Microstructure

2.The Noun Inflection System. NooJ Graphs Implementation

2.1.The Gender and Determination Issues

2.2.The Grammatical Category of Number

2.3.The Grammatical Category of Case