processing frozen sentences in portuguese€¦ · helena moniz may 2019. acknowledgements ......

Processing Frozen Sentences in PortugueseAutomatic Rule and Example Generation from a Lexicon-Grammar

Ana Isabel Silva Galvão

Dissertation for obtaining the Master Degree in

Information Systems and Computer Engineering

Supervisor(s): Prof. Nuno João Neves Mamede

Prof. Jorge Manuel Baptista

Examination CommitteeChairperson: Prof. Paolo Romano

Supervisor: Prof. Nuno João Neves Mamede

Member of the Committee: Profa. Helena Moniz

May 2019

Acknowledgements

I want to start with thanking Professors Nuno Mamede and Jorge Batista for their tireless help, ad-

vice and useful critiques of this research work. I especially thank Professor Nuno for his crucial (and

very frequent) advice to "simplify things", an ability that I frequently lose. Conciliating my working

schedule with the development of this work was not always an easy task, but both Professors always

gave their best to ease the situation.

I deeply thank my mother Isabel and my father Luís for teaching me the value of hard work, and for al-

ways supporting me unconditionally, for being my safety net and for boosting my confidence whenever

I hard a hard time finding it. I thank my boyfriend Henrique for being so incredibly patient even when

I was tired and mean, and for being there for me in any situation - even when it required being enclosed

inside the house on a lovely sunny day. Finally, a heartfelt thank you to my brother João, who endured

all the thesis jouney with me, day and night, facing the worst days with me, all my tantrums and stress

during this period, always making sure that I never felt alone.

Without them I would not have been able to do this.

Lisbon, 10th of May, 2019

Ana Isabel Silva Galvão

i

Resumo

Expressões fixas são expressões multi-palavra que constituem um grande conjunto da léxico-gramática

de muitas línguas, embora a sua frequência em textos seja, muitas vezes, baixa. Analisar expressões fixas

é uma tarefa desafiante porque estas são conjuntos de palavras sintaticamente analisáveis, mas cujo sig-

nificado é não-composicional. dado um sistema de Processamento de Língua Natural para Português

Europeu, o principal objetivo deste projeto é usar a matriz que contém a mais recente descrição linguís-

tica de forma a conseguir traduzi-la para regras Xerox Incremental Parser (XIP), permitindo ao sistema

não só identificar as frases manualmente produzidas que podem ser encontradas na matriz, mas tam-

bém as automaticamente geradas a partir destas, através da aplicação das transformações permitidas

por cada construção.

De forma a atingir esse objetivo, o gerador de regras foi reconstruído de tal forma que as regras geradas

incluam não apenas a estrutura básica do idioma mas também as várias transformações ou redução de

certos elementos a pronomes que podem ser aplicados a cada frase. Um módulo que gera automatica-

mente este tipo de frases a partir das frases base foi também desenvolvido.

Também foi implementada validação automática de forma a verificar o desempenho do sistema, que foi

globalmente melhorado quando comparado com o sistema anterior, permitindo uma identificação mais

correta e abrangente de expressões fixas.

Palavras-Chave

Processamento de Língua Natural

Expressões Fixas

Idiomas verbais

Expressões multipalavra

Categoria gramatical

iii

Abstract

Frozen sentences are multi-word expressions that constitute a large set of the Lexicon-Grammar

of many languages, though their frequency in texts is often very low. Parsing frozen sentences is a

challenging task because they are syntactically analyzable strings whose meaning is non-compositional.

Given an existing Natural Language Processing (NLP) system for European Portuguese, the main goal

of this project is to use the matrix containing the most recent linguistic description in order to be able to

correctly translate it to XIP rules, allowing for it to identify not only manually produced sentences, but

also automatically generated ones from the base sentences by applying the transformations authorised

by each construction.

In order to achieve that goal, the rule generator was rebuilt so that the generated rules include not only

the basic structure of the idiom, but also the several transformations or reduction of certain elements

to pronouns that may be applied to each sentence. A module that automatically generates this type of

sentences from the base sentences was also developed.

Automatic validation was also implemented in order to verify the performance of the system, which

was overall improved when compared to the previously existent system, allowing for a more correct

and inclusive identification of frozen expressions.

Keywords

Natural Language Processing

Frozen Sentences

Verbal Idioms

Parsing Multiword Expressions

Part of Speech

v

Table of Contents

Acknowledgements i

Abstract v

List of Figures ix

List of Tables xii

List of Acronyms xiii

1 Introduction 1

1.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Frozen Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Statistical and Rule-Based Natural Language Processing Chain (STRING) . . . . . . . . . 8

1.5 XIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Related work 17

2.1 Representing Frozen Expressions on a XLSX file . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.1 Converting XLSX to CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.2 Validating the CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.3 Xipificator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2 Previous Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Solution 25

3.1 Lexicon-Syntactic Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Xipificator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Converter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 Rule Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.3 Example Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.4 Example Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

vii

4 Evaluation 43

4.1 Analysing the corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Evaluation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.1 Base sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.2 Artificially generated sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Previous solution vs. Developed solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Conclusions 57

5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

References 59

A Conversion to XIP rules 61

B Readme of the program 71

viii

List of Figures

1.1 STRING architecture [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2 Output tree following pre-processing, disambiguation, and chunking [2]. . . . . . . . . . 13

2.1 General aspect of the matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Modules of the validator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Scheme representing the XIP rules generation; the input is the XLXS file, converted to a

CSV file, that is validated and, in paralel, used for generating XIP rules. . . . . . . . . . . 22

3.1 Comparing the two systems: orange represents what was re-written, green what was added. 26

3.2 Structure of the xipificator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 A schematic representation of the process of generating rules. . . . . . . . . . . . . . . . . 29

3.4 A frozen sentence and the heads of its constituents. . . . . . . . . . . . . . . . . . . . . . . 31

3.5 General mechanism for generating example sentences . . . . . . . . . . . . . . . . . . . . . 36

3.6 Example validation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 Comparing the performance (recall) of the developed system against the performance of

the previous one for the manually produced sentences. . . . . . . . . . . . . . . . . . . . . 52

4.2 Comparing the performance of the developed system against the performance of the pre-

vious one for the artificially generated sentences. . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Comparing the performance of the developed system against the performance of the pre-

vious one for the artificially and manually generated sentences. . . . . . . . . . . . . . . . 54

ix

List of Tables

1.1 Summarized Class Structure, where N represents a free noun phrase, while C is a frozen

constituent; the indices 0,1,2 and 3 correspond to the subject, first, second, and thir com-

plements. Prep is a preposition; w is any sequence of complements (eventually none. . . . 7

1.2 Operators and their functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 XIP syntax for POS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 XIP translation for each column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Restrictions to be added to the rule of the base sentence . . . . . . . . . . . . . . . . . . . . 34

4.1 Sentence distribution per class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Manually produced sentences correctly identified as frozen. . . . . . . . . . . . . . . . . . 46

4.3 Artificially generated sentences for [PronA] correctly identified as frozen . . . . . . . . . 48

4.4 Artificially generated sentences for [PronR] correctly identified as frozen . . . . . . . . . 48

4.5 Artificially generated sentences for [PronPos] correctly identified as frozen . . . . . . . . 49

4.6 Artificially generated sentences for [PronD] correctly identified as frozen . . . . . . . . . 49

4.7 Artificially generated sentences for [RDat] correctly identified as frozen . . . . . . . . . . 50

4.8 Artificially generated sentences for [PassSer] correctly identified as frozen . . . . . . . . 50

4.9 Artificially generated sentences for [PassEstar] correctly identified as frozen . . . . . . 50

4.10 Number of manually produced sentences identified as frozen according to the defined

criteria, for both systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.11 Number of artificially generated sentences identified as frozen according to the defined

criteria, for both systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.12 Number of sentences (manually and artificially generated) identified as frozen according

to the defined criteria, for both systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

A.1 XIP Rule restrictions and instantiation for the class C1 and the example O João abanou o

capacete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

A.2 XIP Rule restrictions and instantiation for the class CDN and the example O Rui sondou a

opinião da Inês . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

A.3 XIP Rule restrictions and instantiation for the class CAN and the example O João matou a

fome do Pedro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

xi

A.4 XIP Rule restrictions and instantiation for the class CNP2 and the example O Rui cortou o

problema pela base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.5 XIP Rule restrictions and instantiation for the class C1PN and the example A Rita afiou os

dentes ao dinheiro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.6 XIP Rule restrictions and instantiation for the class C1P2 and the example O casaco custou

os olhos da cara do Rui. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.7 XIP Rule restrictions and instantiation for the class CPPN and the example O João comprou

gato por lebre ao Pedro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.8 XIP Rule restrictions and instantiation for the class CPP and the example O Zé bate com

o nariz na porta lit: ‘Zé hit with his nose on the door’. . . . . . . . . . . . . . . . . . . . . . 66

A.9 XIP Rule restrictions and instantiation for the class CP1 and the exampleO Zé bateu em

retirada. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

A.10 XIP Rule restrictions and instantiation for the class CPN and the example O Zé desceu na

consideração da Ana. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A.11 XIP Rule restrictions and instantiation for the class C0 and the example A sorte bateu à

porta do Pedro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

A.12 XIP Rule restrictions and instantiation for the class C0E and the example Vai pentear

macacos!. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

A.13 XIP Rule restrictions and instantiation for the class CADV and the example O Pedro não

nasceu ontem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

A.14 XIP Rule restrictions and instantiation for the class CV and the example A resposta não se

fez esperar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

xii

List of Acronyms

L2F Spoken Language Systems Laboratory.

MWE Multiword Expressions.

NLP Natural Language Processing.

PARSEME PARSing and Multi-word Expressions.

POS Part-of-Speech.

STRING Statistical and Rule-Based Natural Language Processing Chain.

TRE Tree Regular Expression.

XIP Xerox Incremental Parser.

xiii

Chapter 1

Introduction

VErbal idioms are idiomatic (semantically non-compositional), Multiword Expressions (MWE)

consisting of a verb and at least one constraint argument slot [5]. Therefore, they are consid-

ered frozen sentences because the verb and at least one of its arguments are frozen together,

that is, they present idiosyncratic and semantically unpredictable distributional constraints. This means

that, unlike free sentences, their meaning cannot be calculated from each individual component, but

rather from the sentence as a whole [5]. Removing any element and replacing it with something else

would turn the sentence to its literal meaning or result in an unacceptable utterance. However, usually,

one or more of the argument noun phrases are distributionally free, which means that they can vary

(within generic distributional constraints) without affecting the global meaning of the sentence. On the

other hand, this type of sentences also differs from free sentences because they block transformations

that should otherwise be possible, given the syntactic properties of the verb and its arguments [3]. One

example of this type of sentences is: O João abriu os cordões à bolsa

In order for the sentence to mantain its meaning:

• None of the complements may have distributional variations (except for the subject);

• The combination abrir-cordões is frozen;

• cordões cannot be replaced by any other expression, nor be modified by adjectives;

• Replacing à bolsa by any other expression would turn the sentence to its literal sense

Finally, frozen sentences represent a problem for many NLP systems because they must not be

treated as a block [4], on the contrary, they have a syntactic structure that yields to analysis, unlike

compound lexical items (nouns, adverbs, conjunctions, etc). Besides, their elements can appear discon-

tinuously and they may also present some formal variations, often being ambiguous - the same sequence

may have a literal and a figurative meaning - and in that case only an extended context can disambiguate

them [3].

Given these facts, it is possible to conclude that the integration of this specific type of expressions in

NLP systems, in order to obtain an accurate semantic parsing, is a challenging task. A great amount of

1

work has been done in this area, such as an European Portuguese annotated corpus built in the scope of

the project PARSing and Multi-word Expressions (PARSEME)1, an interdisciplinary scientific network

devoted to the role of MWE in parsing2. For the purpose of this project, a previously built lexicon-

syntactic matrix was used, which encodes the linguistic information, using the framework of Gross

[9]. Its information will then be integrated into a fully-fledged NLP system built for Portuguese, the

STRING [11]. The STRING system uses the XIP [5] parser to segment sentences into chunks and extract

dependency relations among chunks’ heads [12]. Considering that most idioms have a “normal” syn-

tactic structure, which follows the ordinary word combinatory rules of the general grammar, STRING’s

strategy consists in parsing them first as ordinary sentences and only then identifying specific word

combinations, whose meaning should not be calculated in a compositional way. The idiomatic word

combinations are identified by a dependency, FIXED, which takes as arguments the verb and the frozen

elements of the idiomatic expression (the number of arguments depends on the type of verbal idiom

involved) [5].

1.1 Goal

The main goal of this dissertation project is to use the matrix containing the most recent linguistic

description in order to be able to correctly translate it to XIP rules, allowing for it to identify not only

manually produced sentences, but also automatically generated ones from the base sentences by apply-

ing the transformations authorised by each construction. In order to do so, three essential tasks were

considered:

• To rebuild the rule generator so that the generated rules include not only the basic structure of the

idiom, but also the several transformations or reduction of certain elements to pronouns that may

be applied to each sentence;

• To create a module that automatically generates sentences resulting from applying the foremen-

tioned transformations to the base sentences;

• To create an automatic validator that compares the expected results to the obtained ones, after run-

ning both manually produced and automatically generated sentences in STRING. This validates

not only the correctness of the generated rules, but also of the generated examples.

1https://typo.uni-konstanz.de/parseme/2For more information on this type of research please refer to [6].

2

https://typo.uni-konstanz.de/parseme/

1.2 Thesis Structure

The remainder of this document is structured as follows:

• Chapter 2 briefly describes the work developed so far, namely the representation of the frozen

expressions on a lexicon-grammar matrix, saved in an XLSX file, and the previous implementation

for automatic rule generation;

• Chapter 3 presents the changes made to the lexicon-grammar description, as well as the developed

solution during this project;

• Chapter 4 presents the the evalution methods of the developed solution, the results and their

analysis. A comparison between the previous implementation and the developed one is also per-

formed;

• Chapter 5 presents the conclusions taken from this project, as well as the perspectives of future

work.

1.3 Frozen Sentences

Frozen sentences are elementary sentences where the main verb and at least one of its argument

noun-phrases are distributionally constraint, and usually the global meaning of the expression cannot

be calculated from the individual meaning of its constituents when used independently. Therefore, the

expression should be taken as a complex, multiword, lexical unit [3].

To this date, a set of 2,561 European Portuguese verbal idioms has been classified into 15 formal

classes according to their structure and distributional constraints, as well as their syntactic properties.

Table 1.1 shows the breakdown of frozen sentences per class The theoretical and methodological frame-

work of M. Gross [9, 10] was used to classify this type of expressions. This framework bases its classi-

fication on the structure of the sentence, as well as the number and type of arguments of the main verb

[3]. Ten classes were already considerably developed during a previous development, but the remaining

four are still at an early stage. These are the classes C0, C0E, CADV and CV, which are not very numerous.

In a distributionally free sentence, the overall meaning is determined from the individual meaning

of the elements in the construction, but, the meaning of a frozen sentence cannot be directly calculated

from the meaning that the component elements may present when used separately [4]. In Chapter 3

a step by step description of a sentence will be described. Take as an example the following sentence:

O João abriu os cordões à bolsa, lit: ‘João opened the laces to the bag’ ‘to pay for something’. Here,

no element can be substituted while keeping the overall meaning of the sentence, where the verb-object

combination abrir-cordões is frozen, as well as the combination with the instrumental à bolsa, lit: ‘to the

bag’. One can neither replace cordões, lit: ‘laces’ with another word, nor modify it using a free adjective.

Also, removing the instrumental complement à bolsa, lit: ‘to the bag’ and replacing it with something

else would turn the sentence to its literal meaning.

3

A step-by-step generation of a rule for the sentence O João virou o bico ao prego, as represented in

Figure 3.4 may be found on Chapter 3.

However, frozen sentences usually present some, often highly constraint, formal variation. For ex-

ample, in the sentence O João entregou a alma a Deus, lit: ‘João delivered the soul to God’, ‘to die’. In

this case, the noun Deus, lit: ‘God’ could be replaced by Senhor, lit: ‘Sir’. This would not change the

meaning of the sentence, though the variation paradigm is rather short and often unpredictable. The

frozen verb-noun combination is responsible for this distributional constraint, which can be consider-

ably different from the constraints imposed by the verb when functioning as an independent lexical

unit. For example, the verb vender, ‘to sell’, admits both human and non-human (animal and abstract)

nouns for its subject when its object is alma, lit: ‘soul’, but, in the frozen sentence, only human nouns are

allowed [3].

Another example on how this type of sentences differs from free sentences is the blockage of transfor-

mations that should otherwise be possible, given the syntactic properties of the verb and its arguments.

As a free sentence, the passive transformation with the auxiliar verb ser lit: ‘to be’ is applicable to the

example O João abriu o programa com chave de ouro lit: ‘João opened the program with a golden key’.

It becomes O programa foi aberto com chave de ouro pelo João lit: ‘The program was opened with a

golden key by João’.

Direct transitive constructions (without prepositional complements)

C1 This class represents sentences with a fixed direct complement (without any free determina-

tive complements, see class CDN below): A Maria amarrou o burro, lit: ‘Maria tied the donkey’,

‘to pout’. Sentences belonging to this class may suffer the transformations [Pass-ser] and

[Pass-estar]: O burro foi amarrado pela Maria, O burro está amarrado pela Maria, lit: ‘The

donkey was tied by Maria’.

CDN The sentences belonging to this class also feature a frozen direct complement, but its head

contains a free determinative complement (de N, of N); this determinative complement can-

not undergo a dative reestructuring [Rdat]: O João salvou a pele do Presidente, lit: ‘João

saved the skin of the President’, meaning ‘to save someone’. Sentences belonging to this class

may suffer the transformation [PronPos].

CAN This class is similar to CDN, but its free determinative complement might undergo a dative

restructuring, [RDat]. This is a syntactic transformation that splits a complex noun phrase,

where a metonymic (part-whole) relation is observable between N1 and N2 (N1 de N2, N1 of

N2). This originates two constituents, and the second phrase assumes the syntactic function

of indirect (dative) complement: O Manuel quebrou o coração da Maria, lit: ‘Manuel broke

Maria’s heart’, which becomes O Manuel quebrou o coração à Maria, lit: ‘Manuel broke the

heart to Maria’, the new dative complement canm then, undergo the dative pronouning, i.e.

a reduction to a dative pronoun, e.g. O Manuel quebrou-lhe o coração, lit: ‘Manuel broke to

her the heart’, meaning ‘Manuel broke her heart’. Sentences belonging to this class may also

4

suffer the transformation [PronPos].

Direct transitive constructions (containing one prepositional complement)

CNP2 This class has a free direct complement and a fixed prepositional complement: O Eduardo

levantou o Pedro da lama, lit: ‘Eduardo took Pedro out of the mud’ ‘to help someone get out

of a complicated situation’. Sentences belonging to this class may suffer the transformations

[PronR], [PronA], [Pass-ser] and [Pass-estar].

C1PN This class has a fixed direct complement and a free prepositional complement. O Pedro ac-

ertou as agulhas com a Rita, lit: ‘Pedro got the needles straight with Rita’ ‘to get things right

with someone’. Sentences belonging to this class may suffer the transformations [PronD],

[RDat], [PronPos], [Pass-ser] and [Pass-estar].

C1P2 The sentences belonging to this class have a fixed direct complement and a prepositional

fixed complements: O Pedro cortou o problema pela raiz, lit: ‘Pedro cut the problem at its

root’, ‘solve a problem by addressing its causes’. No transformations can be applied to the

sentences from this class.

Prepositional constructions

CP1 This class contains sentences with only one prepositional complement: O Pedro meteu-se

num trinta e um, lit: ‘Pedro got himself into a thirty one’, ‘to get himself into a complicated

situation’. No transformations can be applied to the sentences from this class.

CPN This class is defined by having a prepositional phrase where the head-noun C is frozen with

the verb, while its determinative complement is free [3]: O Manuel foi ao pelo do Pedro, lit:

‘Manuel went to Pedro’s fur’ ‘Manuel hit Pedro’. Sentences belonging to this class may suffer

the transformations [RDat] and [PronPos].

CPP This class contains sentences with two prepositional complements: O Zé bateu com o nariz

na porta, lit: ‘Zé hit with his nose on the door’ ‘finding a place to be closed or not achieving

something’. Sentences belonging to this class may suffer the transformations [PronD] and

[PronPos].

CPPN This class is defined by containing three essential complements where at least one is frozen

with the verb3 O Pedro apanhou o Filipe com a boca na botija, lit: ‘Pedro caught Filipe with

3Because the number of sentences is small, no further sub-classifiction was established as it was done for other structures.

Notice that this class may admit direct complements as well.

5

his mouth on the cannister’, which means ‘to find someone red-handed’. Sentences belonging

to this class may suffer the transformations [PronR], [PronA], [PronD], [RDat], [PronPos],

[Pass-ser] and [Pass-estar].

Other constructions

C0 In this type of constructions, the subject is frozen together with the verb (which might also

accept other complements, either free or fixed). An example of this is: A sorte sorriu ao Pedro,

lit: ‘Luck smiled at Pedro’, ‘Peter was lucky’. Sentences belonging to this class may suffer the

transformations [PronA], [PronD], [RDat] and [PronPos].

C0E This class is constituted by frozen sentences mandatorily in the imperative or exclamative

mode, the subject is often a second person, i.e. the addressee, which is zeroed: Vai pentear

macacos!, lit: ‘Go comb monkeys!’ ‘do not bother me/anyone anymore’. No transformations

can be applied to the sentences from this class.

CADV In these constructions, the verb is frozen together with an adverb (and usually there are no

other complements): O Pedro não nasceu ontem, lit: ‘Pedro was not born yesterday’ ‘is not

dumb’. No transformations can be applied to the sentences from this class.

CV This class includes constructions involving two verbs, usually with a preposition connecting

the first verb V to the second verb Vc. The first verb should not be analyzed as an auxiliary

for the second verb: Ainda está para nascer alguém assim, lit: ‘It is yet to be born someone

like this’, meaning ‘there is no one like this person’. No transformations can be applied to the

sentences from this class.

6

Table 1.1: Summarized Class Structure, where N represents a free noun phrase, while C is a frozen constituent; the

indices 0,1,2 and 3 correspond to the subject, first, second, and thir complements. Prep is a preposition; w is any

sequence of complements (eventually none.

Class Structure Example Count

C1 N0 V C1

O João não abriu a boca

‘be silent’500

CDN N0 V(C of N)1O João atraiu os olhares da Ana

‘draw someone’s eye’44

CAN N0 V(C of N)1 = C1 to N2

O João calou a boca da Ana

‘shut up someone’182

CNP2 N0 V N1 Prep2 C2

O Rui chamou a Inês à razão

‘call to reason’172

C1PN N0 V C1 Prep2 N2

A Maria desligou os aparelhos ao moribundo

‘switch off the machines’255

C1P2 N0 V C1 Prep C2

O João retomou o fio à meada

‘resume the thread’291

CPPN N0 V C1 Prep C2 Prep C3

O João vendeu gato por lebre à Maria

‘sell cat for hare’46

CPP N0V Prep C1 Prep C2

O Zé não morre de amores pela Ana

‘is not fond of’181

CP1 N0V Prep C1

O Zé voltou à carga

‘charge again onto something’662

CPN N0V Prep(C of N)1O Zé caiu nas garras da Ana

‘fall in the claws of’103

C0 C0V wA sorte bafejou o Pedro

‘luck blown over someone’21

C0E V wVai pentear macacos!

‘go comb monkeys’, ‘get lost’1

CADV N0V AdvO Pedro não nasceu ontem

‘was not born yesterday’70

CV N0V (Prep) Vc wA resposta não se fez esperar

‘did not have to wait much for something’13

Total 2,542

7

1.4 STRING

STRING [11] is a hybrid statistical and rule-based NLP chain for Portuguese, that has been developed

by Spoken Language Systems Laboratory (L2F), at INESC-ID Lisboa. STRING has a modular structure

and performs all the basic NLP tasks. The system’s architecture is shown in Figure 1.1.

LexMan [16] is a morphological tagger that receives the result of this segmentation as input and

associates all possible Part-of-Speech (POS) tags to each segment. It receives as input the text to be

processed and starts by tokenizing it, splitting the text into segments. Besides this, the module is also

responsible for the identification at the earliest possible stage of certain special types of tokens, namely:

email addresses, ordinal numbers (e.g. 3o, 42a), thousand and fractional separator (in the Portuguese

language these are the dot . and/or coma , (e.g. 12.345,67), IP and HTTP addresses, integers (e.g. 12345),

several abbreviations with dot . (e.g. a.C., ‘before Christ’), V.Exa., ‘Your Excellency’), numbers written

in full (e.g. duzentos e trinta e cinco, ‘two hundred and thirty-five’), sequences of interrogation and

exclamation marks, as well as ellipsis (e.g. ???, !!!, ?!?!, ...), punctuation marks (e.g. !, ?, ., ,, :, ;, (, ), [, ]),

symbols (e.g. <, >, #, $, %, &, +, -, *, <, >, =, @), and Roman numerals (e.g. LI, MMM, XIV). Naturally,

besides these special textual elements, the tokenizer identifies ordinary simple words, such as alface,

‘lettuce’. It also tokenizes as a single element sequences of words connected by hyphen, most of them

compound words, like fim-de-semana, ‘weekend’ [11]. Next, this module splits the text into sentences.

Then, RuDriCo2 [7, 8] is applied. This module produces rule-based morphological disambiguation

and it also performs segmentation changes to the input, like joining segments (compound words) or

splitting them (contractions). MARv4 ia stochastic morphological disambiguator. It receives the result

of RuDriCo2, and selects selecting the best POS tag to each segment, given its context. Finally, the

last module to apply is XIP [13], a finite-state incremental parser developed by XeroxRCE, which uses

a Portuguese rule-base grammar, and it is responsible for the syntactic analysis4. This module is also

responsible for parsing verbal idioms, therefore a more detailed description is provided in Chapter 1.5.

Figure 1.1: STRING architecture [11]

4The Portuguese grammar for XIP was initially developed under the scope of collaboration between L2F and Xerox Research

Centre Europe, since 2004 [11]. After that, the effort has been invested mainly by the L2F team.

8

1.5 XIP

This module shall be briefly described based on the documents [14, 15]. The parser allows for the

introduction of lexical, syntactic, and semantic information to the output of the previous modules, as

well as performing the syntactic analysis of the text through the following processes:

• Lexicons: allow for the information to be added to the different tokens. In XIP there is a pre-

existing lexicon, which can be enriched by adding lexical entries or changing the existing ones;

• Chunking Rules: perform a shallow parsing or basic syntactic analysis of the text. For each phrase

type (e.g. NP, PP, VP, etc.) a sequence of categories is grouped into elementary syntactic structures,

called chunks. The chunks types depend on the POS of their head element, usually the last element

of the chunk;

• Dependency Rules: dependencies are syntactic dependency relations between different chunks,

chunk heads, or elements inside chunks and they allow a deeper and richer knowledge about the

text’s information and content. Major dependencies correspond to the so-called deep parsing syn-

tactic functions, such as SUBJECT, DIRECT COMPLEMENT, etc. Other dependencies are just auxiliary

relations, mostly used to calculate the deeper syntactic dependencies. For example, the CLINK de-

pendency links each argument of a coordination to the coordinative conjunction it depends on.

A given dependency can be percolated from one argument to the next, in case the sentences are

coordinated phrases.

Verbal idioms are identified by STRING using a dependency FIXED linking the key elements of the

structure (the main verb and frozen head nouns). The lexicon-grammar of verbal idioms was integrated

in the rule-based parsing module of the NLP in the form of parsing rules. Since frozen sentences are

syntactically well formed structuresm complying with the general word-combination rules of grammar,

the following strategy was adopted to parse them. First, general parsing rules can be applied, as to any

other structure. Then, another set of rules extracts the FIXED dependency based on the previous parse,

and groups together the frozen elements of the idiom, while keeping intact the syntactic structure of

the dependency. Finally, the FIXED dependency is the one used to further calculate the semantics of the

sentence [4].

The fundamental data representation unit in XIP is the node. It has a category, feature-value pairs and

brother nodes. Taking as an example the following node:

Pedro: noun[human, individual, proper, first name, people, sg, masc, maj]

This node represents the noun Pedro and it has several features, used to express its properties: Pe-

dro is a noun that represents a human, a masculine individual (feature masc); the node also has features

to describe its number (singular, sg) and the fact that it is spelled with an upper case initial letter (feature

maj). Moreover, features can be instantiated (operator =), tested (operator :), or deleted (operator =˜)

within all types of rules. While instantiation and deletion are all about setting/removing values to/from

9

features, testing consists of checking whether a specific value is set to a specific feature, as showed on

Table 1.2:

Lexicons

XIP allows the definition of custom lexicons (lexicon files), which add new features that are not stored

in the standard lexicon. Having a rich vocabulary in the system can be very beneficial for improving its

recall. In XIP, a lexicon file begins by simply stating Vocabulary:, which tells the XIP engine that the

file contains a custom lexicon. Only afterwards come the actual additions to the vocabulary. The lexical

rules attempt to provide a more precise interpretation of the tokens associated with a node. They have

the following syntax (the parts of the rule contained in parentheses are optional):

lemma(: POS([features])) (+)= (POS)[features].

Examples of lexical rules:

$US = noun[meas=+, curr=+].

eleitor: noun += [human=+].

acenar += verb[vdat=+].

The first two examples show how to add new features to existing words. In the first case, the features

meas (measure) and curr (currency) are added to $US, which is POS-tagged as a noun; in the second case,

the human feature is added to the noun eleitor (elector). In the third case, the word acenar (wave) ,

irrespective of its former POS, is given the additional reading of verb.

Table 1.2: Operators and their functions.

Type Example Explanation

Instantiated [gender=fem] The value fem is set to the feature gender

Deleted [acc=∼] The feature acc is cleared of all values on the node

Tested [gender:fem] Does the feature keygender have the value fem ?

[gender:∼] The feature gender should not be instantiated on the node

[gender:∼fem] The feature gender should not have the value fem

10

Chunking Rules

Chunking is the process by which sequences of categories are grouped into structures; this is done

using chunking rules. There are two types of chunking rules:

• Immediate dependency and linear precedence rules (ID/LP rules);

• Sequence rules.

In order to illustrate the syntax of the chunking rules, a few examples will be used. The first impor-

tant aspect to be taken into account is that each rule must be defined in a specific layer. This layer is

represented by an integer number, ranging from 1 to 300. Below is an example of how to define two

rules in two different layers:

1 > NP = (art;?[dem]), ?[indef1]. // layer 1

2 > NP = (art;?[dem]), ?[poss]. // layer 2

Layers are processed sequentially from the first one to the last. Each layer can contain only one

type of chunking rule. ID/LP rules are significantly different from sequence rules. ID rules describe

unordered sets of nodes and its syntax is the following:

layer> node-name -> list-of-lexical-nodes.

An example of an ID rule is:

1 > NP -> det, noun, adj.

Assuming that det, noun and adj are categories that have already been declared, this rule can be

interpreted as follows: whenever there is a sequence of a determiner, noun and adjective, regardless of

the order in which they appear, create a Noun Phrase (NP) node. Obviously, this rule applies to more

expressions than those desirable, e.g. o carro preto, lit: ‘the car black’, o preto carro, lit: ‘the black car’,

preto carro o, lit: ‘black car the’ and carro preto o lit: ‘car black the’. This is where LP rules come into

play: these rules work with ID rules to establish some order between the categories, while sequence

rules describe an ordered sequence of nodes. By being associated with ID rules, LP rules can apply to

a particular layer or be treated as a general constraint throughout the XIP grammar. LP rules have the

following syntax:

layer> [set-of-features] < [set-of-features].

Considering the following example:

1> [det:+] < [noun:+].

11

1> [noun:+] < [adj:+].

This illustration of chunking rules states that a determiner must preceed a noun on layer one, and

that a noun must preceed an adjective on the same layer (the actual grammatical rules governing the

relative position of adjectives and nouns are much more complex). This means that expressions such as

o preto carro (‘the black car’) will no longer be allowed. However o carro preto, lit: ‘the car black’ will. It

is also possible to use parentheses to express optional categories, and an Kleene star to indicate that zero

or more instances of a category are accepted. The following rule states that the determiner is optional

and that zero or more adjectives are accepted, to form a NP chunk:

1> NP -> (det), adj*, noun.

Considering both LP rules established above, the following expressions are accepted: carro, lit: ‘car’,

carro preto, lit: ‘car black’, o carro preto, lit: ‘the car black’, o carro preto bonito, lit: ‘the car black

beautiful’.

Finally, it is worth mentioning that these rules can be further constrained with right and/or left con-

texts. For example:

1> NP -> | conj |adj, noun | verb |.

This rule states that a conjunction appears at the left of the sequence of categories, and that a verb must

appear at the right side of that sequence. By applying this rule to sentence such as E carros pretos há

muitos na estrada, lit: ‘and black car there are many on the road’, the following chunk will be obtained:

NP[carros pretos].

Despite helping to constraint a rule even further, contexts are not saved inside a node.

The other kind of chunking rules, sequence rules, though conceptually different because they describe

an ordered sequence of nodes, are almost identical to the ID/LP rules as far as their syntax is concerned.

There are, however, some differences and additions:

• Sequence rules do not use the -> operator. Instead, they use the = operator, which matches the

shortest possible sequence. In order to match the longest possible sequence, the @= operator is

used instead;

• There is an operator for applying negation (˜) and another for applying disjunction (;);

• Unlike ID/LP rules, the question mark (?) can be used to represent any category on the right side

of a rule;

• Sequence rules can use variables.

12

The following sequence rule matches expressions like alguns rapazes/uns rapazes, lit: ‘some boys’,

nenhum rapaz, lit: ‘no boy’, muitos rapazes lit: ‘many boys’ or cinco rapazes, lit: ‘five boys’; [indef2]

and [q2] are features of lexical items:

1> NP @= ?[indef2];?[q3];num, (AP;adj;pastpart), noun.

Finally, consider the example O Zé bateu em retirada, lit: ‘Zé has withdrawn’ ‘to run away’. At this

stage, after the pre-processing and disambiguation, and also after applying the chunking rules, the sys-

tem presents the chunking output tree illustrated on Figure 1.2.

Dependency Rules

This step is crucial for a richer understanding of texts. Dependency rules take the sequences of con-

stituent nodes, identified by the chunking rules, and identify syntactic dependency relations between

them. A dependency rule presents the following syntax:

|pattern| if <condition> <dependency_terms>.

In order to understand what the pattern is, first it is essential to understand what is a Tree Regular

Expression (TRE). A TRE is a special type of regular expression that is used in XIP in order to establish

connections between distant nodes. In particular, TREs explore the inner structure of subnodes through

the use of braces ({}). The following example states that a NP node’s inner structure must be examined

in order to see if it is made of a determiner and a noun:

NP{det,noun}.

TRE support the use of several operators, namely:

• The semicolon (;) operator is used to indicate disjunction;

• The Kleene star (*) operator is used to indicate ’zero or more’;

Figure 1.2: Output tree following pre-processing, disambiguation, and chunking [2].

13

• The question mark (?) operator is used to indicate ’any’;

• The circumflex (ˆ) operator is used to explore subnodes for a category.

Hence, and returning to the dependency rules, the pattern contains a TRE that describes the structural

properties of parts of the input tree. The condition is any Boolean expression supported by XIP (with

the appropriate syntax), and the dependency_terms are the consequent of the rule.

The first dependency rules to be executed are the ones that establish the dependencies between the

nodes, as seen in the next example:

|NP#1?*, #2[last] |

HEAD(#2, #1)

This rule identifies HEAD relations (see below) in noun phrases. For example, in the NP a bela rapariga

(‘the beautiful girl’), the rule extracts a HEAD dependency between the head noun rapariga (‘girl’) and the

whole noun phrase — HEAD(rapariga, a bela rapariga).

As already stated, the main goal of the dependency rules is to establish dependencies between the

nodes. The following output is the current result of applying these rules to the sentence O Zé bateu em

retirada, lit: ‘Zé beat in retreat’ ‘to run away’:

MAIN(bateu)

DETD(Zé,O)

VDOMAIN(bateu,bateu)

MOD_POST(bateu,retirada)

SUBJ_PRE(bateu,Zé)

FIXED(bateu,retirada)

NE_PEOPLE_INDIVIDUAL(Zé)

0>TOP{NP{O Zé} VF{bateu} PP{em retirada}}

The last two lines indicate that one named entitiy NE has been captured and classified in this sentence:

Zé has been identified as HUMAN INDIVIDUAL PERSON = PERSON. The tag NE_INDIVIDUAL_PEOPLE is used

to see that the NE have been classified. The other dependencies listed above cover a wide range of binary

dependencies such as:

• The relation between a nominal head and a definite determiner (DETD);

• The verb (MAIN);

• The relation between a modifier and, in this case, the verb it modifies (MOD_POST);

• The subject of the verb (SUBJ_PRE);

• The fixed dependency identified between the verb and the noun (FIXED).

14

To see a complete list and a detailed description of all syntactic dependency relations as of May

2016, please refer to [2]. XIP’s syntax for these conditional statements also allows the operators & for

conjunction and | for disjunction. Parentheses are also used to group statements and establish a clearer

precedence.

15

Chapter 2

Related work

This chapter aims to describe both the architecture and previous behaviour of the system, as well as

the work done so far in the linguistic description.

2.1 Representing Frozen Expressions on a XLSX file

The lexicon-syntactic description of frozen expressions is represented in a matrix, as shown in Figure

2.1, contained in a XLSX file. This matrix is composed by a header and a set of properties for frozen

sentences, and each sentence is described for each sentence is done on a line of the matrix. The first

column refers to the class, represented by a conventional code, defined based on M. Gross’ criteria

[9], for describing frozen sentences. The possible values for this column are the ones defined on the

Subchapter 1.3.

Figure 2.1: General aspect of the matrix.

The first few columns refer to how the rule should be generated

Exotic No rule is generated because the structure of the sentence is atypic, or its use is deemed too

rare;

17

Fail Used to mark the cause for the validation error. If it is empty, it is assumed that there is no

error;

Ignore Determines what should be ignored when generating a rule;

AllManual If this column is checked, the content of the cell Manual will contain the XIP rule for this ex-

pression;

Manual Reserved cell, where the manual rule is inserted. This type of rules describe patterns that

cannot be automatically generated by the system;

Example A sentence to be used for testing with the validator;

Observations Remarks regarding the rule;

Other example Second example to be tested with the generated XIP rule;

Expected This cell contains the expected result to be produced by the XIP’s dependencies list for the

expression. It is used only when there are problems, so that it makes possible to compare

what should be produced with what was, in fact, obtained.

Distributional and verb-related properties

N0 = Nhum The head of the subject NP is a human noun, e.g. Maria;

N0 = N-hum Head of the subject NP of the sentence is not a human noun, e.g. casaco, ‘coat’;

Vse The verb in this expression presents an intrinsically pronominal reflex construction, e.g. fazer-

se de Lucas, ‘pretend to ignore something’;

NegObrig The expression presents a construction containing an obligatory negation modifier, e.g. não

dar para as encomendas, ‘someone who is unable to correspond to the requests”;

V Main verb of the frozen construction;

PrepLink Preposition that links the first verb of the construction to a second verb, both frozen together

(class CV, see Chapter 1.3), e.g. Ainda está para nascer quem me há de ganhar nisto, lit: ‘It is

yet to be born the one who will beat me on this’;

Vc The second verb of a construction with two fixed verbs (class CV) e.g. Este caminho vai dar à

praia, lit: ‘This path leads to the beach’.

18

Constituent’s common components

C0 The lexical element that is the head of the constituent 0;

Det0 The (fixed) determinant of the constituent;

Modif0-E The (fixed) modifier, to the left of the constituent;

Modif0-D The (fixed) modifier, to the right of the constituent;

C0Manual XIP’s manual rule for all the modifiers. Overcomes the rule that was automatically generated.

This is useful when some exceptional rule representation is required.

On the other hand, constituents 1 to 4 contain, besides the fore-mentioned components, the following

ones1:

Prep1 Preposition that introduces C1;

AttachV1 By default, the N+1 noun depends on the verb, unless it is introduced by the preposition

de. By checking this cell with a "+", a dependency to the verb is created instead, rather than

keeping the default dependency to the previous chunk2;

[PronR1] The (free) noun phrase N1 can be reduced to a reflexive pronoun; e.g. Besides O Pedro entre-

gou tudo nas mãos de Deus, ‘Pedro put everything in the hands of God’ one could also find

O Pedro entregou-se nas mãos de Deus, ‘Pedro put himself in the hands of God’;

[PronD1] The complement N$ is distributionally free, and it can be reduced to a dative pronoun3; e.g.

O Pedro tirou o chapéu ao João lit: ‘Pedro took off the hat to João’, after a dative restrusc-

turing (see [Rdat] below, it would become O Pedro tirou-lhe o chapéu, lit: ‘Pedro took off

his hat’;

[PronPos1] The (free) prepositional phrase "de N$" can be reduced to a possessive pronoun; e.g. O Zé

fala nas costas da Ana, ‘Zé speaks behind Ana’s back’ becomes O Zé fala nas suas costas,

‘Zé speaks behind her back’;

1The description is made for index 1, but it is the same for all constituents.2The rules are generated considering the STRING’s operating behaviour3In this pronouning process, the preposition a, ‘to’ (rarely para, ‘to’) is also reduced.

19

[Pass-ser] The auxiliary copulative verb accepted for the passive of this construction is ser, ‘to be’; e.g.

A imprensa abafou um escândalo, ‘The press smothered a scandal’ becomes Um escândalo

foi abafado pela imprensa, ‘A scandal was smothered by the press’;

[Pass-estar] The copulative verb accepted for the passive can be any copulative verb except ser, ‘to be’;

The agentive subject is zeroed in the passive form; e.g. A imprensa abafou um escândalo,

lit: ‘The press smothered a scandal’ becomes Um escândalo está abafado pela imprensa, ‘A

scandal was smothered by the press’;

[Pass-se] This construction admits the pronominal passive form; it is currently not used because it does

not occur very often in verbal idioms;

[Neutra] This construction admits the neutral passive form; it is currently not used;

Normalized A particular set of predicates is paired with a generic verb; e.g bater as botas, lit: ‘kick the

boots’ or ir para o maneta, lit:‘go to the one handed man’ are labeled as morrer, lit: ‘to die’.

Constituent 1 has two extra components:

ADV1 Adverbial complement (fixed) usually an adverb (for class CADV only), e.g. : O Pedro foi em-

bora, ‘Pedro went away’;

[PronA1] The (free) noun phrase N$ can be reduced to an accusative pronoun, e.g. O João viu a Inês

pelo canto do olho becomes O João viu-a pelo canto do olho, lit: ‘João saw Inês from the

corner of his eye’ becomes lit: ‘João saw her from the corner of his eye’ (for classes CNP2,

with a free CDIR);

And components 2 to 4 contain two other exclusive components4:

[Rdat2] If selected, the sentence will allow for a dative restructuring operation, where a determinative

complement de N ‘of N’becomes a dative complement a N ‘to N’ more closely attached to the

verb. This new dative complement is then often reduced to a dative pronoun5, e.g. O João

come as papas na cabeça do Pedro, lit: ‘João eats the mash on Pedro’s head’ ‘to make a

fool out of someone’ becomes O João come-lhe as papas na cabeça, lit: ‘João eats to him

the mash on head’;

4The description is made for index 3, but it is the same for index 4.5[Rdat$] always implies that the constituent can be reduced to a dative pronoun. Hence, whenever [PronD$] is marked

as +, [Rdat$ is - and vice-versa.

20

[Sim2] This property is known as simmetry: two constituents of this construction can be coordinated

in a given syntactic position (either symmetric subjects or symmetric complements) and can

trade places, without changing the global meaning of the sentence [1]; e.g. A Isabel juntou

os trapinhos com o Luís, lit: ‘Isabel gathered her rags with Luís’’, has the same meaning as

O Luís juntou os trapinhos com a Isabel , lit: ‘Luís gathered her rags with Isabel’s’, which

is ‘to get together/married’, hence, the two constituents can be coordinated. A pronominal

copy, also known as echo complement [1], such as um com o outro, can be added to the

sentence with coordinated constituents but this copy is optional. Then the sentence becomes

O Luís e a Isabel juntaram os trapinhos (um com o outro), lit: ‘Isabel and Luís gathered

the rags (with one another)’. Symmetric constructions, i.e., the coordinated forms of these

frozen sentences, have been described and formalized in a separate work [1] and will not be

considered in this project.

2.1.1 Converting XLSX to CSV

A converter from XLSX to CVS has been previously developed. This converter is a simple script that

receives as its argument the XLSX file and transforms each of its cells into a value, separated by a comma.

This CSV file is then used to generate the XIP rules, so it is very important that this conversion does not

fail. For this purpose, the CSV file needs to be validated. This is the matter of the next Subchapter 2.1.2 .

2.1.2 Validating the CSV

The validation of the CSV was is broken down into smaller, but fundamental, subtasks, observable

on Figure 2.2:

1. Validating whether the data is according to what was expected (consistency of each element). Here

the definition of the vectors of classes and possible fields to be ignored is made. This step also

defines the validation matrix, with the following structure: [Column name, Validation Type,

Possible Values];

2. Asserting the consistency between each element, respecting the property of each sentence and

the values of each column, e.g. whether the values are consistent or inconsistent for nouns and

prepositions;

3. Validating class consistency by checking whether each class contains the expected arguments, in-

cluding the symmetry property;

4. Checking the consistency between column values; validation of whether all the restrictions are

being respected and there are no impossible combinations.

This validator was left as-is and it was not subject to any alterations. It was used for validating the

matrix on this project.

21

Figure 2.2: Modules of the validator

2.1.3 Xipificator

The process starts with converting the XLSX file to a CSV file, if necessary. The previous system is

a Pearl application that generated, in an automatic way, XIP rules that would allow for the extraction

of the FIXED dependency. The input was a XLSX file containing a matrix with the lexical, syntactical

and semantic description of 2,520 manually produced frozen expressions, as represented in Figure 2.1.

Its output is a file containing the set of generated XIP rules that are then included in STRING. The con-

ventions used in the matrix are represented in Table 3.1, 27. The notation used for representing the

number/index of the constituent is $. Given that the index ranges from 0 to 4, N$=Nhum can become

N0=Nhum, N1=Nhum and so on. However, the $ will be replaced by 0 in the constituents common to all de-

pendencies (such as determinants, prepositions, modifiers...). This way it is not necessary to enumerate

the same constituents for every dependency.

The system already includes a validation script for the generated rules. This script will instantiate the

XIP rules for each generated example and run it on STRING, as represented on Figure 2.3, later checking

whether the dependencies for the frozen expression are correctly extracted. The pipeline of procedures

allows for a correct identification of a frozen sentence.

2.2 Previous Implementation

The previous implementation for the automatic generation of XIP rules is provided on three files. The

main one is named xipificator.pl. It also uses a file named xipificator_aux_functions.pl and a

Figure 2.3: Scheme representing the XIP rules generation; the input is the XLXS file, converted to a CSV file, that is

validated and, in paralel, used for generating XIP rules.

22

xipificator_validate.pl. These contain both the auxiliary functions necessary for the intermediate

tasks as well as a validator of the generated rules.

The process starts with fetching the necessary arguments, such as the input file, the pattern name

and the XLSX sheet name. Then it converts the XLSX file to a CSV file, if necessary. After this it proceeds

with searching for the corresponding patterns. A pattern defines a correspondence between a column

and an element. If none exists or it is marked as AUTO the system will guess the pattern based on the

names of the filled columns. At last, it writes the rule and a comment containing an example. The

method for writing a rule is the following:

1. Prints the restriction for the verb;

2. Prints the restriction for the negative form of the verb;

3. Prints the restriction for the clitic;

4. Searches for the elements of a dependency and prints their dependency links, and then a function

makes recursive calls until it reaches the last dependency.

The final output for each line of the matrix is a XIP rule, which consists in a set of restrictions that must

be obeyed so that the system extracts the FIXED dependency, thus identifying a construction as frozen.

If the restrictions are not well-formed, that is, if some restrictions are missing or misplaced, the system

may incorrectly extract the dependency or extract it containing the wrong arguments. The validation

performed by this system verified only whether the FIXED dependency had been extracted, ignoring the

correctness of its arguments.

2.2.1 Issues

Given that the sentences are described in a matrix, the system was built around a static number

of columns and attributes. After it was built, the matrix suffered a number of changes, including the

addition of transformations to be applied to the rules. These changes caused the system to stop being

able to generate XIP rules. Besides, neither transformations nor pronominalization had been predicted

on the previous system. This required adopting a new strategy for generating rules that are able to

recognize expressions containing whose elements have undergone these formal changes. However,

there are almost no manually generated sentences that contain transformations. Therefore, these need

to be automatically generated from the manually produced ones, found in the matrix.

Another problem is that the previous system was running one sentence at a time, initializing the system

each time. This results in a big delay in obtaining results, taking around 18 hours to process the sentences

in STRING and obtain results. Finally, in this implementation, the only validation method being used

by this system is to verify whether the FIXED dependency has been extracted. No further verification

concerning the arguments of the dependency is done, and these may not be correct.

23

Chapter 3

Solution

THIS chapter aims at describing the architecture and implementation of the proposed solution.

The main goal of this project is to use the matrix containing the most recent linguistic descrip-

tion and to be able to correctly translate it to XIP rules, allowing for the system to identify not

only manually produced sentences but also other sentences, automatically from the examples encoded

in the matrix by applying the transformations authorised by each construction. In order to do so, the

rule generator was rebuilt so that the generated rules capture not only the basic structure of the idiom,

but also the several transformations or reduction of certain elements to pronouns that may be applied

to each sentence.

A module was created for generating examples in an automatic way after several possible transforma-

tions have been applied to the base sentences, found in the matrix, such as pronominalization and the

passive form. These examples were to be ran on STRING, alongside the base sentences.

Finally, an automatic validator was developed. This validator receives as input the results of processing

all the sentences, manually produced and artificially generated, and compares them against what was

expected.

The main differences between the previous and current systems are represented in Figure 3.1, where the

green modules are the ones that were created from scratch, and the orange modules are the ones that

were restructured or suffered some form of modification. The blue ones were left untouched, but they

were integrated in the system. the inputs and outputs represented in this image will be detailed as each

module is observed in detail.

This chapter will start by describing the structure of the lexicon-syntactic matrix. Following this, the

architecture and implementation of the new modules and the changes performed on the existing ones

will be detailed.

3.1 Lexicon-Syntactic Matrix

To develop this project, a manually produced set of 2,561 European Portuguese verbal idioms was

used. This set is grouped in 15 formal classes according to their structure and distributional constraints.

These are described in a lexicon-syntactic matrix, a XLSX file, which will be used for both rule and ex-

25

Figure 3.1: Comparing the two systems: orange represents what was re-written, green what was added.

ample generation, as shown in Figure 2.1.

The version of the lexicon-syntactic description used in this project is version 13 from April 2019. This

version was meanwhile updated with corrections for the problems found during the development of

this work.

The matrix file starts with a header - containing the name of the column - and it is followed by a set

of properties for identifying a specific frozen sentence, one sentence per line. Each column contains an

element of the rule, or a restriction to it. The meaning of each column may be consulted in Chapter 2

since it has not been changed during this project. During the development of this work, several column

values were either incoherent or even incorrect. Some of these problems were detected while generating

the rules, while others were detected using the previously existing validator of the matrix. Besides

this, due to recent considerations the values of the matrix suffered some evolutions, and some columns

became more inclusive regarding the values they accepted. An example of this is Modif-E, which only

accepted -, <E> or the explicit value of the modifier. Now, it also accepts the +, which means that it can

have any optional modifier to the left of the complement. Whenever mandatory, that modifier needs to

be explicit in the matrix.

Each column represents a restriction to a rule and, by omission, each column has a type of dependency

or a pre-defined POS. The POS of the word connected by that dependency may be altered using a prefix

on that word, as show on Table3.1.

The information provided by these POS may be enriched using flags, added to the POS in the fol-

lowing way: <POS:FLAGS>. For example, a determiner and a possessive pronoun, feminine and plural

is written as <DET+POS:fp>. The currently defined flags are m/f for masculine/feminine gender, s/p

for singular/plural number, and O for oblique (personal pronouns).

Next, the prefixes allow for the change of type of dependency connection or for adding special fea-

tures. One example of this happens with the word retomar. It should be indicated as the word (verb)

26

Table 3.1: XIP syntax for POS

General

No lexical element <E>

Surface prince

Compound expression “prince charming”

Lemma <prince>

Two options of a lemma

for the same entry(<prince> + <princess>)

Two options of a surface

for the same entry(prince + princess)

Parts-of-Speech and inflection tags

One determinant <DET>

A possessive pronoun determinant <DET+POS>

A possessive pronoun determinant,

feminine and plural,

followed by a word (e.g. próprias)

<DET+POS:fp> “próprias”

A demonstrative pronoun

determinant<DET+DEM>

A possessive pronoun <PRON+POS>

A personal pronoun <PRON+PES>

Conventions used for the POS recognized at the moment

Determinant DET

Adjective A

Adverb ADV

Ordinal, cardinal or quantity Q

Preposition PREP

27

tomar with a prefix (re-). In order to force the existence of that prefix the entry should be written as

PFX:<tomar>.

By omission, the system contains the followig features:

• MOD: modifier

• CDIR: direct complementent

• CIND: indirect complementent

• PREDSUBJ: subject’s predicate

• PFX: prefixed word

Before serving as input to the entire system this matrix is validated, as described in Chapter 2.2. One

important thing in this implementation is that the names for each column are pre-defined, so that the

developed program is able to automatically identify its pattern, that is, associate a column to its value

using the column’s name, despite its position. This allows for the columns to appear in any order.

3.2 Xipificator

The Xipificator is a generic designation used for the container of a set of three internal modules, as

seen in Figure 3.2. It takes as input the lexicon-syntactic matrix, in the form of a XLSX file, and it starts

with converting it to a CSV file. Then, this file serves as input to two modules, developed in the scope

of this solution: Rule Generation and Example Generation.

The first one is a module that processes the CSV file and outputs a set of XIP rules to serve input to the

STRING system, and a text file containing each sentence, either manually produced or automatically

generated, the class it belongs to, and its expected output. The second module is a module that uses

the CSV file to output a text file containing a set of sentences - the ones manually inserted in the matrix

and some that were artificially generated from those, containing the transformations each construction

allowed. After STRING processes these examples the output will be written into a text file. A final mod-

ule, the example validator, will compare the output against what was expected, showing the percentage

of correctly identified frozen sentences. A description of each module will be done in the following

Subchapters, starting with the external converter module.

3.2.1 Converter

This module converts XLSX files to CSV files, readable by the Rule Generation and Example Gener-

ation modules, inside the Xipificator. This converter was re-written from the existing one into a Python

module, for integration purposes. It takes as input the XLSX file containing the lexical description ma-

trix. The converter opens the XLSX file, and transforms each of its cells into a value, separated by a

comma. This CSV file will then be used by the remaining two internal modules in order to generate the

rules and the examples, which will be written in two different files.

28

Figure 3.2: Structure of the xipificator

3.2.2 Rule Generation

This module receives as input the CSV file and outputs a XIP file containing the XIP rules generated

from that CSV, represented in Figure 3.2 as ’dependencyFPhrase.xip’, as well as a TXT file, represented in

Figure 3.2 as ’expected.txt’, containing each sentence (manually produced and automatically generated),

the class it belongs to, and its expected output. This will later be used by the example validator. The

CSV file is read into the module, which creates an internal structured representation of each line of the

CSV, and its corresponding values. This allows the information to be encapsulated inside the program

and it is no longer necessary to access external files.

A simplified schema of how the rule generation is performed is shown on Figure 3.3.

Figure 3.3: A schematic representation of the process of generating rules.

The process of generating the rules is complex, given that each line from the matrix is associated to a

corresponding XIP rule, and each possible column value contributes with a restriction to that rule. The

29

module starts with writing the example correponding to the read line. Then, it verifies whether the rule

is manually produced. If so, the rule is read from the column Manual. If not, the rule for that sentences

rule is generated according to the values of the lexicon-syntactic matrix. In case any transformation can

be applied to that sentence, the restrictions associated with that transformation are added to the rule. If

no transformation is applicable, it writes the rule and the expected value to be extracted from XIP for

that sentence.

This translation of each property takes the form of the correspondent dependency where each vari-

able corresponds to the name of the column. Despite the canonical number of the constituent, (0, 1, 2, 3

or 4), all properties, binary or lexical, are translated the same way. Table 3.2 presents the translation of

each column to a XIP restriction. ?V represents the verb of the frozen construction, C1 is the fixed nouns

of the first complement, and so on.

Table 3.2: XIP translation for each column

General

N0=Nhum SUBJ(?V,?[UMB-Human])

N0=N-hum SUBJ(?V,?[UMB-Human:∼])

N1=Nhum MOD[post](?V,?[UMB-Human]) or CDIR[post](?V,?[UMB-Human])

N1=N-hum MOD[post](?V,?[UMB-Human:∼]) or CDIR[post](?V,?[UMB-Human])

Vse CLITIC(?V,[ref])

Vc VLINK(?V,?Vc)

NegObrig MOD[neg](?V,?)

V VDOMAIN(?V,?)

Adv1 MOD[post](?V,[adv])

Modif1-E MOD[pre](?V,?C1)

Modif1-D MOD[post](?V,?C1)

Prep1 PREPD(?C1,?Prep1)

Det1 DETD(?C1,?Det1)

C1 MOD[post](V?,C1?)

PronA1 CLITIC(?V,?[acc])

PronR1 CLITIC(?V,?[ref])

PronD1 CINDIR(?V,?) & CLITIC(?V,?[dat])

PronPos2 POSS(?C2,?)

Pass-ser VDOMAIN(lema[pass-ser],?V)

Pass-estar VDOMAIN(lema[pass-ser:∼],?V)

The dependency representation in the form of variables is organized as follows:

• #? – Free variable

30

• #1 - Subject

• #2 – Verb

• #3 – First complement

• #4 – Second complement

• #5 – Third complement

• #6 – Fourth complement

The subject’s representation is SUBJ(?V,?[UMB-Human]), and the feature UMB-Human determines

whether the subject is human or not (SUBJ(?V,?[UMB-Human:∼])). The verb is defined by the de-

pendency VDOMAIN. This dependency is captures the first and the last verb of an auxiliary verb chain

consisting of one or several auxiliary verbs and a main verb (the last in the chain). Each complement

may be marked as CDIR if it does not have a preposition or MOD if it is1. Determinants and pre-modifiers

or post-modifiers are both connected to the constituents’s head. In case one of the modifiers contains

the value -, the existence of a dependency is accepted but optional. However, if it contains the <E>

entry, it is considered that there is no dependency for that type. Below, an example of the step-by-step

generation process of a rule is presented, for the frozen sentence O João virou o bico ao prego, lit: ‘João

turn the tip to the nail’, ‘to betray’ (class C1P2), which is depicted with its constituents in Figure 3.4.

Figure 3.4: A frozen sentence and the heads of its constituents.

After each dependency tag is associated to a column/column value, the system starts generating

the if() structure of the rule. First, it prints a structure of a dependency link for a verb, and searches

recursively for the elements of a dependency, generating their dependency links, until it reaches the last

dependency.

1. The V column is converted to the restriction as VDOMAIN(#?,#2[lemma:virar]);

2. The first complement, column C1 is converted into a restriction is encoded as CDIR[post] (the post

flag refers to the post-verbial position) because there is no preposition associated to this comple-

ment: CDIR[post] (#2,#3[surface:bico]); So the XIP rule evolves into:

if ( VDOMAIN(#?,#2[lemma:virar]) &

CDIR[post](#2,#3[surface:bico]) &

1At this current stage of parsing no distinction is made yet between essential (argument) complements and adjuncts, so that

MOD dependency functions as an umbrella for both cases.

31

...

)

3. The next restriction to be encoded is Det1, which produces the restriction DETD(#3,?[surface:o]).

The rule evolves to:



DETD(#3,?[surface:o]) &

...

)

4. The second complement, C2, is then translated as MOD[post](#2,#4[surface:prego]). It is im-

portant to notice that its head is connected to the verb, instead of the previous complement, which

is explicitly marked in the matrix by the property AttachV.

if (VDOMAIN(#?,#2[lemma:virar]) &



MOD[post](#2,#4[surface:prego]) &

...

)

5. Next, Prep2 is encoded as PREPD(#4,?[surface:a]).

The XIP rule now becomes:





PREPD(#4,?[surface:a]) &

...

)

6. Finally, the last column to be encoded as a restriction is Det2, producing DETD(#4,?[surface:o]).

Thi results in the rule:

if (VDOMAIN(#?,#2[lemma:virar]) &





DETD(#4,?[surface:o])

)

32

Finally, to allow for an easier reading and correction the rule is represented in the rules’ file as:

//========================================================

// Example: O João virou o bico ao prego

//========================================================







)

FIXED(#2, #3, #4)

////ORIGINAL O João virou o bico ao prego

////EXPECTED FIXED(virar, bico, prego)

The following step is to integrate the rules in the XIP dependencies file. When running the foremen-

tioned example on STRING, each variable of the rule will then be instatiated according to perfomed

analysis of the elements of the sentence, obtaining the following dependency rules:

VDOMAIN(virou,virou)

CDIR[post](virou,bico)

DETD(bico,o)

MOD[post](virou,prego)

PREPD(prego,a)

DETD(prego,o)

These dependency rules will then be compared against those found in the output provided by XIP:

MAIN(virou)

DETD(João,O)

DETD(bico,o)

DETD(prego,o)

VDOMAIN(virou,virou)

MOD_POST(virou,prego)

SUBJ_PRE(virou,João)

CDIR_POST(virou,bico)

33

Given that the elements of the generated rule are present in the output, and therefore the restrictions

are satisfied, the FIXED dependency is extracted, as FIXED(virou,bico,prego)

A simplified rule generation example per class is presented on Appendix A.

Whenever a transformation may be applied to a sentence, two things happen:

• The Example Generation module will automatically generate a sentence containing the transfor-

mation(s)

• The restrictions relative to that transformations are added to the rule or, in case the transformation

is either [Pass-ser] or [Pass-estar], no restrictions are added to the base sentence rule and,

instead, a new rule is generated for the sentence after the transformation was performed.

The general restrictions for each transformation are described on Table 3.3, however, the passive form

requires a bit more work to be done. First, the verb itself has to be encoded in a different way. Then,

a conversion of the constituents from the base sentence to the passive form is also performed. When-

ever a sentence contains wither a direct complement or a post modifier, it becomes a subject in the new

rule, generated to represent the passive form of that sentence. Using the example O Rui deixou a Inês

em paz, lit: ‘Rui left Inês in peace’ ‘to leave someone alone’, Inês plays the role of direct complement,

CDIR[post](#2,#3[UMB-Human,UMB-Human:∼]), to this construction. However, when transforming it

to the passive form, this will become the subject SUBJ(#2,?). This conversion is performed by a table

that corresponds what each element of the active form should be in the passive form containing, for

now, only the elements CDIR and MOD[post].

Table 3.3: Restrictions to be added to the rule of the base sentence

Restrictions

[PronA] ( CDIR[post](?V,?C1[UMB-Human]) || (CLITIC(?V,?[acc]) || CLITIC(?V,?[ref]) )

[PronR]( CDIR[post](?V,?C1[UMB-Human,UMB-Human:∼]) ||

(CLITIC(?V,?[acc]) || CLITIC(?V,?[ref]) )

[PronD]( MOD[post](?V,?C1[UMB-Human]) & PREPD(?C1,?) ) ||

(CINDIR(?V,?) & CLITIC(?V,?[dat]) ) )

[PronPos] ( ( MOD[post](?C2,?C3[UMB-Human]) & PREPD(?C3,?) ) || POSS(?C1,?) )

[RDat] ( ( MOD[post](?C2,?C3[UMB-Human]) & PREPD(?C3,?) ) || CLITIC(?V,?[dat]) )

[Pass-Ser] VDOMAIN(#?,#2[pass-ser,?V])

[Pass-Estar] VDOMAIN(#?,#2[pass-ser:∼,?V])

A configuration file has been created in order to make it possible to determine which restrictions

are to be applied to the generated rule. The controllable restrictions are determinants, prepositions and

modifiers, both to the left and to the right of the frozen head noun, and any distributional constraints

to any of the free complements/subject. This file also contains the column numbers that correspond to

34

each element to be encoded in the rules. This is done so that the user has total freedom to choose how

restrictively are the rules intended to be applied, which, in turn, will make the system more flexible.

The output of this module consists in two files: one containing the original sentence as well as the

rule that describes it, which will be used by STRING to extract the FIXED dependency; and another,

containing the sentence, the class it belongs to, and the expected output for that sentence.

3.2.3 Example Generation

Using the data structures created from the CSV file, this module outputs a file containing the arti-

ficially generated sentences, represented in 3.2 as ’examples.txt’. These sentences are generated from

the information encoded for the corresponding manually produced sentence, which were produced by

linguists trying to capture the basic structure and distribution of the frozen sentences. The artificially

generated sentences correspond to the form produced by applying the accepted transformations by a

given construction, as encoded in the matrix.

The generation starts with verifying, for each sentence, whether its description contains a positive value

for each column [PronR1], [PronA1],[PronPos2], [Rdat1], [Rdat2], [PronD1], [PronD2], [Pass-estar]

or [Pass-ser]. If so, the system will read, from the description of that sentence, each of its constituents.

Using this, it generates a new sentence (one sentence per transformation), containing the mandatory

complements after the changed required by each transformation have been applied to each. Although,

in an initial phase, these new sentences were written alongside the corresponding base sentence in a text

file, it was later decided that they should be written into separate files, according to the type of transfor-

mation applied to them. This allowed for an easier manual verification and validation of the obtained

sentences for each transformation. They are ran on STRING and later validated separately from the base

sentences, which allows for a clearer distiction of the system’s performance for each type of sentence,

manually produced or automatically generated.

The common mechanism for generating each sentence is described in 3.5. Next, the generation process

for each transformation is detailed.

The verbs in the active form are read from the column V, in the infinitive form. Their third person,

singular, present tense conjugation is read from a file, ’Verb3s.txt’, previously generated by ViPEr. As

for the verbs in the passive form, this is done in a different way, because this form requires the verbs

ser, estar to be added before the main verb. So the main verb is read from the column V, in the infinitive

form, and then its past participle form is read from a file, ’VerbVpp.txt’, previously generated by ViPEr.

In order to generate the subject, and whatever complement not explicit in column C, the columns N$

= Nhum and N$ = Nhum are verified, in order to determine if that constituent is either human or non

human. If it is human, a name is chosen randomly from a list of names. If it is not human, a generic

name, Isso, lit: ‘that’ is used in the generation of the sentence. A description of how the sentences were

generated for each transformation follows:

Generating reflexive pronoun sentences

After the subject is set and the verb is read, the lastest is rewritten it by adding the suffix ’-se’

to it. The following complements, determinants and prepositions are written after the verb. So,

35

Figure 3.5: General mechanism for generating example sentences

using the the description of the sentence O Pedro reduziu a Ana à sua insignificância, lit: ‘Pedro

reduced Ana to her insignificance’ the system generates the sentence: O Pedro reduziu-se à sua

insignificância, lit: ‘Pedro reduced himself to his insignificance’.

Generating accusative pronoun sentences

After verifying the gender of the first complement, to be replaced by the accusative pronoun (and

therefore not written in the sentence), the suffix ’-a(s)’ or ’-o(s)’, ‘him/her’, is added to the verb, if

regular, re-writing it. However, whenever the verb is irregular, namely the ones that finished with

’z’, they had to be processed in a specific way in order to obtain its correct third person, singular,

present tense conjugation. There are two situations:

• The verb trazer, lit: ‘to bring’, which required conjugation is traz, lit: ‘brings’ sees its last two

letters replaced by á-lo. So, traz, is transformed into trá-lo. If had not been implemented, the

transformation would turn the verb into traz-o, which is incorrect.

• Any other verbs ending in ’z’ sees this letter be replaced by -lo. So, for example, conduz,

is transformed into condu-lo. If this had not been implemented, the transformation would

re-write the verb into conduz-o, which is incorrect.

The following complements, determinants and prepositions are written after the verb. So from the

description of the sentence O João tirou o Pedro da lama, lit:‘João took Pedro out of the mud’, ‘to help

someone get out of a complicated situation’ the system generatees O João tirou-o da lama, lit: ‘João took

him from the mud’.

Generating dative pronoun sentences

There are two types of dative pronoun transformations. The first one is relative to the first com-

plement, and it happens whenever [PronD1] is marked positive. After the subject is set and the

verb is read, the suffix ’-lhe’, ‘to him/her’ is added to the verb, re-writing it. Because this replaces

36

the first complement, it is not written in the generated sentence. This way, from the description for

the sentence A sorte bateu ao Pedro, lit: ‘Luck hit to Pedro’, ‘Pedro was lucky’ the system generates

A sorte bateu-lhe, ‘Luck hit him’.

The second type is relative to the second complement, and it occurs whenever the entry [PronD2]

is marked as positive. The suffix added to the verb ’-lhe’ here replaces the second complement.

So from the description of the sentence O João deve favores ao Pedro, lit: ‘João ows favours to

Pedro’ the system generates the sentence O João deve-lhe favores, lit: ‘João ows him favours’. The

following complements, determinants and prepositions are written after the verb.

Generating possessive sentences

In case [PronPos2] is marked as positive, the words seu(s) or sua(s), lit: ‘his’ or ‘hers’, according

to the second complement’s gender, is added ahead of the first complement. Because this replaces

the second complement, it is not written in the generated sentence. For example, when reading

the description of the sentence A sorte bateu à porta do Pedro, lit: ‘Luck hit on the door of Pedro’,

‘Pedro was lucky’, the complement do Pedro, lit: ‘of Pedro’ is ignored and the generation process

replaces with with seu, lit: ‘his’. Therefore the generated sentence is A sorte bate à sua porta, lit:

‘Luck hit on his door’.

Generating dative restructurated sentences

There are two types of dative restructuration transformations, but both transform a de_Nhum to

a_Nhum. The first one is relative to the second complement, and it happens whenever [Rdat2] is

marked positive. The suffix ’-lhe’ here replaces the second complement, while keeping the first.

Therefore, using the description of the sentence A sorte bateu à porta do Pedro, lit: ‘Luck hit to the

door of Pedro’, ‘Pedro was lucky’ the system generates A sorte bateu-lhe à porta, lit: ‘Luck hit on

his door’, The second type is relative to the third complement, and it occurs whenever the entry

[Rdat3] is marked as positive. The suffix ’-lhe’ here replaces the third complement, generating

from the description of the sentence O João entregou o livro em mãos ao Pedro, lit: ‘João delivered

the book in hands to Pedro’ the following: O João entregou-lhe o livro em mãos, lit: ‘João delivered

him the book in hands’. This transformation is mutually exclusive with [PronD], so whenever one

is marked as positive the other cannot be positive as well.

Generating passive sentences

Two types of passive have been considered, namely the one with the auxiliary verb ser and the

one with the verb estar (and its variants, especially ficar, ‘to stay’ and continuar, ‘to continue’, both

in English mean ‘to be’ (the difference is only aspectual). As for the passive transformation, a

positive value in the column [Pass-ser] causes the system, using the description of the example

O Rui arrastou o nome da Rita pela lama, lit: ‘Rui dragged the name of Rita through the mud’ to

generate Isso foi arrastado pela lama, lit: ‘This was dragged by the mud’. The reason for replacing

the constituent nome da Rita lit: ‘name of Rita’ for isso lit: ‘this’: the description only demands

for a non-human name to be present, and therefore nome lit: ‘nome’ can be replaced by a generic

non-human name and da Rita lit: ‘of Rita’ becomes unecessary in the generated sentence.

37

As for the passive transformation using the verb estar, ‘to be’, [Pass-estar], the description of

the example O João controla o Pedro com rédea curta lit: ‘João controls Pedro with short reins’,

‘to very rigorously control someone’ generates O Fernando está controlado com rédea curta lit:

‘Fernando is controlled with short reins’.

The automatic generation of these sentences went through several iterations, as the results were

manually verified by a linguist. The criteria for evaluating them always considered the characteristics

of XIP and its restrictions.

The generation process revealed itself as very much important for the manual validation, either by a

human or a linguist, of the values present in the matrix. It also allows for the detection of a set of

restrictions that might not be represented in the matrix, given that their properties may have not been

studied yet (for example, tense and mood of the verbs).

3.2.4 Example Validation

This example validator receives as input the output that was generated by STRING, written in an

XML file. From here it extracts what was effectively obtained, and builds a text file with this informa-

tion: the sentence, what was expected, and the obtained result, represented in 3.2 as ’output.txt’. Given

that the rule generator already provides the system with what is expected, the next step is to compare

the two: what was expected and what was acquired.

The validator considers 3 criteria in order to evaluate a success and "how much" of a success the de-

tection had, as seen in Figure 3.6. This criteria range from the less specific to the more specific, in the

following order:

1. Checking whether the FIXED dependency was extracted;

2. Checking whether the number of arguments of that dependency matches the expected number of

arguments;

3. Asserting that the arguments of that dependency match the expected arguments.

The results of processing each sentence through STRING is an XML file with each sentence represented

as an LUNIT. Each of these LUNITs will be parsed through until the FIXED dependency is found. When

it is found, the arguments of this dependency are parsed, and their lemmas extracted. The element with

index 0 is the verb, and all the other indexes correspond to the remaining constituents that are part of

that dependency. The obtained dependency FIXED is then built in reverse, from what was obtained -

given that it is from the output - until something such as FIXED(0, 1, 2, 3...) is obtained (with 0

being the verb, 1, 2 and 3 the remaining arguments).

In case no FIXED dependency is extracted, the parser returns "FAILED".

After the example validation is concluded, an output report is generated by the system, contaning a

sentence, its expected value and its result value. In case the FIXED dependency not being at all extracted,

an "X" is written by the beginning of the sentence. This allows for an easier regex isolation of the failed

sentences, and for a more efficient problem solving:

38

Figure 3.6: Example validation criteria

X - Sentence: O João combate moinhos de vento.

Expected value: FIXED(combater, moinhos de vento)

Result value: FAILED

In case the dependency FIXED is extracted but the number of arguments is not the same, the follow-

ing is written on the output file:

Sentence: O Padre disse a missa.

Detected FIXED with wrong number of arguments.

Expected value: FIXED(dizer, missa)

Result value: FIXED(dizer)

As for the case where the dependency FIXED was extracted, the number of arguments is correct, and

the arguments are exactly the ones that were expected, this is the output:

Sentence: A imprensa abafou um escândalo.

The arguments are the same.

Expected value: FIXED(abafar, escândalo)

Result value: FIXED(abafar, escândalo)

This means that the expected extracted dependency is FIXED(abafar, escândalo), and that was

exactly what was extracted. Therefore, the validator will consider this as a successful detection for all

three reasons, and will add it to the number of correctly identified frozen sentences. Although initially

the fact that the FIXED had been extracted was enough to consider this as a successful case, the system

evolved to counting the number of arguments of the FIXED dependency, and checking whether they

matched the expected number of dependencies, and finally evaluating whether they are exactly the

same.

The output file is a report that shows the percentage of correctly identified sentences separated per

class, as well as the global percentage for all sentences.

39

The statistics for each class are presented as a header of all the sentences belonging to that class.

Below the statistics for the manually produced senteces belonging to class C1 are presented:

--------------------------

IDENTIFIED 497 OUT OF 500 SENTENCES, 3 MISSING

STATS FOR CLASS C1: 0.994; 0.006000000000000005 MISSING


STATS FOR CLASS - NUMBER OF ARGUMENTS: 0.99; 0.010000000000000009 MISSING


STATS FOR CLASS - ARGUMENTS: 0.972; 0.028000000000000025 MISSING

--------------------------

The global percentage of identified sentences, for all three criteria, is presented in the bottom of each

file:

--------------------------

IDENTIFIED AS FIXED 2430 OUT OF 2542 SENTENCES, 112 MISSING

TOTAL STATS FOR FIXED: 0.955940204563336; 0.04405979543666405 MISSING


TOTAL STATS FOR NUMBER OF ARGUMENTS: 0.9445318646734855; 0.05546813532651451 MISSING


TOTAL STATS FOR ARGUMENTS: 0.9193548387096774; 0.08064516129032262 MISSING

--------------------------

The system was automated using a makefile that runs all the modules. It starts with the rule gener-

ator module, replacing the previous rule set on XIP with the generated ones. It also runs the Example

Generator, and all the examples are ran through STRING. The results are then retrieved and put through

the validator, which then outputs the report. It takes 6 and a half minutes for the system to perform all

these tasks.

This script may be found on Appendix B.

3.3 Improvements

When comparing the developed solution with the previously existing one, mainly by observing Fig-

ure 3.1, several important aspects might be pointed out:

1. The implementation of automatic generation of examples for the passive form and pronominal-

ization is a very important feature because it allows for a variation of the same sentence to be

recognized;

2. The fact that the sentences are not ran one at a time, but in a file instead, allows for a significant

reduction of the time it takes to obtain results. In the previous system processing 2,542 sentences

would take around 18 hours, while in the developed one it takes 6 and a half minutes;

40

3. The new example validator allows for a more detailed detection of frozen sentences and errors. In

the previous system the only factor to be taken into account was whether the FIXED dependency

had been extracted or not. Now, the three different criteria for validation, joined with the role of

the output report plays a big role allow for a more ’spot on’ detection of errors in the generation.

41

Chapter 4

Evaluation

THis chapter describes the evaluation process and methods used in this work. It starts by de-

scribing the structure of the corpus to be evaluated, and the methods used to evaluate it. Then,

the results will be presented, followed by an analysis of the results obtained after processing

this corpus is made. Finally, a comparison between the new solution and the previously existent one, in

terms of the number of frozen sentences were detected, is performed.

Despite the multiple iterations the system traversed in order to further improve its results, the system

had to be frozen at some point, so that the results could be registered. This was done on version 23, May

2019.

The system is initialized by running a make file, which takes around 6 and a half minutes to finish

its execution. This time includes generating the rules and examples, running all the examples through

STRING, and validating the obtained results.

4.1 Analysing the corpus

The corpus to be evaluated was divided in two parts. The first one contains all the manually produced

sentences, or base sentences, that is 2,542 sentences extracted from the matrix; and the second one is a set

of 1,173 artificially generated sentences from the the description of the base sentences, considering the

transformations encoded in the lexicon-grammar matrix. The distribution of generated sentences per

class only considers the entries accepting these transformations. These sentences were manually veri-

fied by a linguist, and they played a big part in the correction and improvement of the lexicon-grammar,

because it is required that their corresponding base sentences lexical description is clear and correct.

Each transformation’s distribution per class, as well as the distribution of the manually produced sen-

tences per class can be observed in Table 4.1. The set of artificially generated sentences was broken down

by transformation, so that each could be evaluated separately. This allows a system’s performance eval-

uation per transformation, rather than evaluating the performance of all the generated sentences.

The joint set corresponds to a total of 3,715 frozen sentences, grouped into classes, as described on Chap-

ter 2, and the evaluation was performed not only globally, but also per class.

43

Table 4.1: Sentence distribution per class.

Class # Manual # [PronR] # [PronA] # [PronD] # [RDat] # [PronPos] # [PassSer] # [PassEstar]

C1 500 0 0 0 0 0 3 3

C0-E 1 0 0 0 0 0 0 0

CDN 45 0 0 0 0 34 0 0

CAN 182 0 0 0 181 178 0 0

CNP2 172 18 172 0 0 0 169 73

C1PN 259 0 0 138 4 3 3 3

C1P2 291 0 0 0 0 0 0 0

CPPN 46 4 15 6 9 3 10 4

CPP 181 0 0 26 0 4 0 0

CP1 662 0 0 0 0 0 0 0

CPN 103 0 0 0 2 96 0 0

C0 21 0 2 5 2 2 0 0

CADV 70 0 0 0 0 0 0 0

CV 13 0 0 0 0 0 0 0

TOTAL 2,542 22 189 176 198 320 185 83

By observing Table 4.1 it should be noted that these classes do not have the same degree of lexical

coverage, as the collection of these is still ongoing or has just recently started1. Despite this, the results

for all the classes will be shown.

Class CP1 is the most significant when considering the set of base sentences, representing around 26% of

the total ammount of sentences. It is followed by C1, which represents around 20% of this set. The less

representative class is C0-E, containing only one entry. Classes C0, CADV, CDN, CPPN and CV are not very

numerous, each of them containing less than 100 sentences.

The transformation with a broader distribution within the lexicon-grammar matrix is the posses-

sive pronominalization ([PronPos]), corresponding to 27% of the total number of generated sentences.

With a lesser scope, the reflexive pronominalization ([PronR]) corresponds to 1,8% of the generated

sentences.

[PronR] and [PronA] occur more regularly on class CNP2, due to the pronominalization of its fixed

direct complement. [PronD] occurs mainly on class C1PN, by pronominalizing its free prepositional

complement. [RDat] and [PronPos] occur more frequently on class CAN, because its free determina-

tive complement might either undergo a dative restructuring or be reduced to a possessive pronoun.

[Pass-ser] and [Pass-estar] occur more regularly on class CNP2, which free direct complement be-

comes a subject. Despite the fact that the number of artificially generated sentences is half the number of

manually produced sentences, these examples are of significant importance because they are a variation

of the base sentences, and they may appear in texts replacing the base sentences.

1The classes are mainly classes C0, C0-E, CADV and CV.

44

4.2 Evaluation method

The evaluation was performed following three criteria:

1. Checking whether the FIXED dependency was extracted;

2. Checking whether the number of arguments of that dependency matches the expected number of

arguments;

3. Asserting that the arguments of that dependency match the expected arguments.

These three criteria were defined so that there was a notion of how exact was the extraction of the

FIXED dependency, answering the question ’was the dependency extracted for the arguments that were

expected?’. One important note to be made is that it is crucial to interpret these results according to their

lexical representativity, that is, the number of expressions in the lexicon, as seen in Table 4.1. Bearing

this in mind, and because the class lexical representativity is not the same for all classes, the total result

is not calculated as an average of the results of the classes. It is calculated for each class, and for the

total ammount of sentences. This means that some classes may have a very low recall, however it is

not critical for the overall picture, especially if they contain a low number of sentences. Therefore an

intrinsic evaluation is performed, by measuring the recall:

Recall =TruePositive

TruePositive+ FalseNegatives

which, in this situation, translates in the number of frozen sentences detected amongst the entire set of

frozen sentences, that it, the proportion of actual positives that were identified correctly. One important

highlight is that there are no FalseNegatives because all the sentences are assumed to be correct. For the

generated sentences that implies performing a thorough manual verification.

4.3 Results

4.3.1 Base sentences

Table 4.2 presents the results obtained for the base sentences, by class and by the total number of

sentences, according to different criteria and different ways to interpret the results.

45

Table 4.2: Manually produced sentences correctly identified as frozen.

Class # Total# Extracted

FIXED

% Extracted

FIXED

# Correct number

of arguments

% Correct number

of arguments

# Correct

arguments

% Correct

arguments

C0 21 20 95,2% 18 85,7% 15 71,4%

C0-E 1 0 0,0% 1 0,0% 0 0,0%

C1 500 497 99,4% 495 99,0% 486 97,2%

C1P2 291 287 98,6% 280 96,2% 265 91,1%

C1PN 259 251 98,4% 245 96,1% 242 94,9%

CADV 70 66 94,3% 66 94,3% 63 90,0%

CAN 182 173 95,1% 173 95,1% 171 94,0%

CDN 45 44 97,8% 44 97,8% 43 95,6%

CNP2 172 170 98,8% 170 98,8% 167 97,1%

CP1 662 618 93,4% 614 92,7% 600 90,1%

CPN 103 82 79,2% 79 76,7% 72 70,0%

CPP 181 167 92,3% 165 91,2% 161 89,0%

CPPN 46 45 97,8% 45 97,8% 45 97,8%

CV 13 10 76,9% 7 53,8% 7 53,8%

TOTAL 2,542 2,430 95,6% 2,401 94,5% 2,337 91,9%

Each line of Table 4.2 refers to a class, indicated in the first column. The last line refers to the total

number of recognized sentences. The second column, named # Total, contains the total number of sen-

tences for that class. The third and fourth columns contain, respectively, the number and the percentage

of sentences from which the FIXED dependency was extracted. The fifth and sixth contain, respectively,

the number and the percentage of sentences from which the FIXED dependency was extracted with the

same number of arguments as expected. The seventh and eighth columns contain, respectively, the

number and the percentage of sentences from which the FIXED dependency was extracted with the ex-

act same arguments as expected. The percentages are calculated dividing the value of the cell by the

total number of sentences of that class.

In an overall observation it is possible to see that the task of recognizing manually produced sentences

was rather successful, with an overall extraction of the FIXED dependency of 95,6% of these sentences.

Knowing that the dependency was extracted, 94,5% had the number of arguments correspondent to

what was expected and 91,9% has the exact same arguments as expected. Therefore there is only a dif-

ference of 3,7% between the number of sentences from which the FIXED dependency was extracted and

the number of sentences with the actual correct arguments, the most specific criteria. This means that

whenever the FIXED dependency is detected, it is very likely that it contains at least the correct number

of arguments. Some errors are related to constructions using past participles, such as O Zé tinha ido

com a cara da Ana, lit: ‘Zé had gone with Ana’s face’, which means to like someone. The rule for this

sentence expects the output FIXED(ir, cara), but the validator builds the extracted output as being

46

FIXED(ter, cara). The error is in the validator itself, which is adding as the main verb the verb ter, lit:

‘to have’, instead of ir, lit: ‘to go’. So, according to the validator, the number of arguments is the same,

but the aguments do not match, but STRING is extracting the dependency correctly. For several other

unsual constructions there are also mismatches between expected and obtained, probably because the

rules generated by the system do not accomodate these constructions. This might be a problem when

extrinsically evaluating the system, due to the fact that the occurence of these constructions may be ele-

vated in texts. Other errors are mainly due to STRING’s wrong POS tagging and disambiguation.

The development of the rule generation was done through several iterations. The result of each iter-

ation required manual validation of the rules, and several problems were detected on STRING, on the

developed system and on the lexicon-grammar description. One example is that STRING did not inter-

pret compound adverbial expressions as a compound, but rather as individual components, therefore

failing to identify as FIXED a lot of the sentences belonging to the CADV. However, on a final phase of

this project, a detailed manual validation of each problem was performed, and corrections were applied

on both the STRING and the developed system. This greatly improved not only the detection of FIXED

sentences, but also of FIXED sentences containing the correct arguments. Before this manual validation

the values ranged from 79% for the most specific criteria to 86,6% to least specific one.

4.3.2 Artificially generated sentences

Tables 4.3, 4.4, 4.6, 4.5, 4.8, 4.9 present the results obtained for the artificially generated sentences,

by class and by the total number of classes, split by transformation. Each line of these tables refers

to a class, represented in the first column. The last line refers to the total number of recognized sen-

tences. The second column, named # Total, contains the total number of sentences belonging to that

class. The third and fourth columns contain, respectively, the number and the percentage of sentences

from which the FIXED dependency was extracted. The fifth and sixth contain, respectively, the num-

ber and the percentage of sentences from which the FIXED dependency was extracted with the same

number of arguments as expected. The seventh and eighth columns contain, respectively, the number

and the percentage of sentences from which the FIXED dependency was extracted with the exact same

arguments as expected. The percentages are calculated dividing the value of the cell by the total number

of sentences corresponding to that class. This total number of sentences refers only to the sentences to

which the mentioned transformation may be applied.

Following these tables, an overall analysis of the obtained results is performed.

47

Table 4.3: Artificially generated sentences for [PronA] correctly identified as frozen


FIXED

% Extracted

FIXED

# Correct number

of arguments

% Correct number

of arguments

# Correct

arguments

% Correct

arguments

C0 2 2 100,0% 2 100,0% 2 100,0%

CNP2 172 156 90,1% 156 90,1% 156 90,1%

CPPN 15 12 80,0% 12 80,0% 12 80,0%

TOTAL 189 170 90,0% 170 90,0% 170 90,0%

For the [PronA] transformation, described in Table 4.3, the obtained results were very satisfactory.

Only 19 sentences were not identified as frozen, and every sentence detected as FIXED contained the

expected arguments. The reasons for this are probably due to faulty functioning of the STRING. One

example of this is the sentence O Filipe conhece-o de nome, lit: ‘Filipe knows him by name’, and several

similar sentences containing the determinant de, lit: ‘by’. Ths chain is not able to extract the FIXED

dependency for this type of sentences.

Table 4.4: Artificially generated sentences for [PronR] correctly identified as frozen


FIXED

% Extracted

FIXED

# Correct number

of arguments

% Correct number

of arguments

# Correct

arguments

% Correct

arguments

CNP2 18 17 94,4% 17 94,4% 17 94,4%

CPPN 4 4 100% 4 100% 4 100%

TOTAL 22 21 95,5% 21 95,5% 21 95,5%

The frozen sentences identification process was very much successful on the [PronR] transforma-

tion, as shown on Table 4.4, with a 95,5% identification on all three criteria, and only one sentence failing.

However, this transformation is also the one that contains the smallest number of elements, 22, being,

therefore, the least representative of all transformations, and each failure takes a toll on the calculations.

The only sentence from which the dependency FIXED fails to be extracted is O Fernando vê-se ao

perto. The rule for this expressions expects a MOD[post](se,"ao perto"), and this is not extracted by

STRING.

48

Table 4.5: Artificially generated sentences for [PronPos] correctly identified as frozen


FIXED

% Extracted

FIXED

# Correct number

of arguments

% Correct number

of arguments

# Correct

arguments

% Correct

arguments

C0 2 2 100% 2 100% 2 100%

C1PN 3 2 66,7% 2 66,7% 2 66,7%

CAN 178 176 98,9% 176 98,9% 176 98,9%

CDN 34 33 97,1% 33 97,1% 3 97,1%

CPN 96 78 81,3% 77 80,2% 77 80,2%

CPP 4 4 100,0% 4 100,0% 4 100,0%

CPPN 3 3 100,0% 3 100% 3 100%

TOTAL 320 298 93,1% 297 92,8% 297 92,8%

The generated sentences for the transformation [PronPos] achieved, overall, very good results, as

shown on Table 4.5. One example of a problem in the rule generation is the sentence FIXED is O Pedro

quis mal à Maria, lit: ‘Pedro wanted harm to Maria’, wishing bad things to happen to someone. STRING

extracts a CDIR_POST(quer,seu), and the rule expects a MOD_POST(quer,seu). Another example, this

time due to errors in the chain, is the sentence: O Henrique acaba com a sua raça, lit: ‘Henrique ends

with someone’s race’, to kill someone. Here, all the obtained restrictions are expected by the rule, but

the dependency is not extrated, probably due to disambiguation issues with the word raça.

Table 4.6: Artificially generated sentences for [PronD] correctly identified as frozen


FIXED

% Extracted

FIXED

# Correct number

of arguments

% Correct number

of arguments

# Correct

arguments

% Correct

arguments

C0 5 5 100% 5 100% 5 100%

C1PN 138 131 94,9% 131 94,9% 131 94,9%

CPP 26 21 80,8% 21 80,8% 21 80,8%

CPPN 7 3 42,9% 3 42,9% 3 42,9%

TOTAL 176 160 90,9% 160 90,9% 160 90,9%

The sentences generated by the [PronD] were, for the most part, adequately parsed, achieving very

good results, as shown on Table 4.6. Every time the FIXED dependency is extracted, it is extracted

containing the correct arguments.

Some failures are related to the fact that the generated rules are missing some components. One example

of this is the sentence generated from the description of the sentence O João entregou em mãos o livro

ao Pedro, lit: ‘João handed in hands the book to Pedro’. What is being generated is the sentence O João

entrega-lhe, lit: ‘João delivers to him’, while it should be O João entrega-lhe algo em mãos, lit: ‘João

delivers to him something in hands’.

49

Table 4.7: Artificially generated sentences for [RDat] correctly identified as frozen


FIXED

% Extracted

FIXED

# Correct number

of arguments

% Correct number

of arguments

# Correct

arguments

% Correct

arguments

C0 2 2 100,0% 2 100% 2 100%

C1PN 4 4 100% 4 100,0% 4 100,0%

CAN 181 178 98,3% 178 98,3% 178 98,3%

CPN 2 2 100,0% 2 100,0% 2 100,0%

CPPN 9 9 100,0% 9 100,0% 9 100,0%

TOTAL 198 195 98,0% 195 98,0% 195 98,0%

The transformation [RDat] obtained great results, as shown on Table 4.7. Every time the FIXED

dependency is extracted, it contains the expected arguments. One example of a failure that is due to

STRING’s disambiguation issues is O João corta-lhe as vazas, which has no literal translation to en-

glish, but it means to make someone’s plans more difficult, where vazas is labeled as a verb, but in this

context is a name. Another STRING related problem happens in the sentences O João não lhe largava

a braguilha, lit: ‘João would not release his fly’ and O João não lhe largava a porta, lit: ‘João would not

release his door’. Their rules expect to find a CDIR[post](largava,braguilha), and instead found a

MOD[post](largava,braguilha).

Table 4.8: Artificially generated sentences for [PassSer] correctly identified as frozen


FIXED

% Extracted

FIXED

# Correct number

of arguments

% Correct number

of arguments

# Correct

arguments

% Correct

arguments

C1 3 2 66,7% 2 66,7% 2 66,7%

C1PN 3 3 100,0% 3 100,0% 3 100,0%

CNP2 169 165 97,6% 164 97,0% 164 97,0%

CPPN 10 9 90,0% 8 80,0% 8 80,0%

TOTAL 185 179 96,8% 177 95,7% 177 95,7%

Table 4.9: Artificially generated sentences for [PassEstar] correctly identified as frozen


FIXED

% Extracted

FIXED

# Correct number

of arguments

% Correct number

of arguments

# Correct

arguments

% Correct

arguments

C1 3 2 66,7% 2 66,7% 2 66,7%

C1PN 3 3 100,0% 3 100,0% 3 100,0%

CNP2 73 63 86,3% 62 84,9% 62 84,9%

CPPN 4 4 100,0% 3 75,0% 3 75,0%

TOTAL 83 72 86,7% 70 84,3% 70 84,3%

50

The passive transformation with both verbs ser and estar, ‘to be’ presented very good results, as

shown on Tables 4.8 and 4.9. Most issues are common to both types of passives.

Some errors are related to sentences containing the preposition por, ‘by’, Isso foi cortado pela raiz, lit:

‘That was cut off by the root’, which rule expects COMPL[post](#2,#3[surface:raiz]), and instead

finds MOD[post](#2,#3[surface:raiz]).

Other errors are related to the POS tagging performed by STRING. The rule for the sentence Isso foi

reduzido à expressão mais simples, lit: ‘This was reduced to the simplest expression’, expects a com-

posed adverbial expression as a modifier, MOD[post](reduzido,expressão mais simples). How-

ever, STRING breaks down the expression into two different modifiers, MOD[post](reduzido,expressão)

and MOD[post](expressão,simples). This prevents the FIXED dependency extraction for this rule.

It is possible to observe that the system was very much successful in detecting the automatically

generated sentences from the base sentences’ description by applying the transformations authorised

by each construction, having achieved above 93% recall. The difference between criteria for these sen-

tences is much smaller than for the manually generated sentences. On the final round of manual ver-

ifications, the sentences left knowingly unrecognized often have unsolvable problems, related to word

disambiguation and POS tagging. One important remark to make is that these it is the first time that

such a number of artificially generated sentences and the obtained results were very much satisfactory

given that are were not only there are no small recall values, they are all above , its average value, for

this type of sentences, is 93%.

As a final experience, a set of non fixed sentences was manually produced. This was done using

randomly selected fixed sentences and deforming them, removing some of its fixed complements. This

was done in order to check whether the system would identify them as non fixed. From a set of 513

sentences, 434 were detected as non-fixed. The remaining 79 are probably due to the sentence still

being too similar to the fixed one, or to the fact that the rules do not contain enough restrictions for a

complement or determinant.

4.4 Previous solution vs. Developed solution

After obtaining all the results from this system, it was deemed interesting to compare them against

the results that would be obtained for the same corpus, but using the previously existent system. No-

tice that the previous system also produced the XIP rules from the lexicon-grammar matrix, even if it

had been developed from a previous stage of development of the linguistic description, namely, for a

slightly smaller (yet similar) set of frozen sentences. This set contained 2,520 frozen sentences, against

the 2,542 current base sentences, and 3,715 base sentences joint with the automatically generated sen-

tences. Although there was no significant increase of the number of sentences, the criteria for belonging

to a certain class became more and more specific, and the description of each class was perfected over

time.

51

Doing so yielded the results seen in Tables 4.10 and 4.11. Due to the low percentage of identified

transformed sentences in the previous system, the sentences were merely grouped into manually pro-

duces sentences and artificially generated ones, instead of detailing each transformation separately. The

total number of sentences, used to calculate the recall is 2,542 for sentences present in the matrix, and

1,173 for the artificially generated sentences.

Figures 4.1, 4.2, 4.3 compare the two systems in several ways, described below. The blue line corre-

sponds to the developed system, and the green line corresponds to the previously existent system.

Starting with Figure 4.1, it shows a comparison between the percentage of manually produced sentences

identified by each system, according to the three different criteria presented above.

Table 4.10: Number of manually produced sentences identified as frozen according to the defined criteria, for both

systems.

Criteria# Previous

system

% Previous

system

# Developed

system

% Developed

system# Difference % Difference

Fixed dependency 2,149 84% 2,430 96% 281 12%

Same number of arguments 1,892 75% 2,401 94% 501 19%

Exact arguments 1,796 74% 2,337 92% 541 18%

Figure 4.1: Comparing the performance (recall) of the developed system against the performance of the previous

one for the manually produced sentences.

There are two details that can be observed immediately:

• As the criteria grow more specific, there is a small decrease on recall for the developed system,

and a higher decrease for the previous one;

• The developed system mantains higher values on all criteria.

Reading the data on Figure 4.1, it can be observed that both systems start on a very different thresh-

old, with a 12% difference between them, with the developed system on the lead. After this, it mantains

its advantage by keeping a comfortable margin while veryfing the number of arguments, as well as

52

their correspondence to what was expected. This shows a consistent system, with small variations be-

tween criteria: the difference between the recall of sentences from which the FIXED dependency was

extracted with the same number of arguments and the recall of sentences from which the FIXED depen-

dency was extracted with the exact same arguments is around 4%, whilst for the previous system is 10%.

This means that, for the developed system, not only there is a higher probability for the dependency

FIXED to be extracted, but it is also very likely to contain the correct arguments, or at least the same

number of arguments.

One of the most important things to take into consideration here is that, despite the fact that not only

the system was improved in order to identify a wider range of sentences and their transformations, its

performance improved greatly when compared to the previous one. The fact that the correct identifica-

tion of arguments is also evaluated contributes for the results to be more trustworthy.

As for Figure 4.2, it shows a comparison between the percentage of artificially generated sentences

identification in both systems.

Table 4.11: Number of artificially generated sentences identified as frozen according to the defined criteria, for

both systems.

Criteria# Previous

system

% Previous

system

# Developed

system

% Developed


Fixed dependency 257 22% 1095 93% 838 71%

Same number of arguments 241 21% 1090 93% 849 72%

Exact arguments 240 20% 1090 93% 850 72%

Figure 4.2: Comparing the performance of the developed system against the performance of the previous one for

the artificially generated sentences.

53

By analyzing this figure it is possible to observe two important aspects. The first one is that, despite

the existing difference, it is not as accentuated between criteria as there is for the manually produced

sentences, both systems react in different way to the criteria tuning. For the developed system, there is

no variation between the percentage of extracted FIXED dependency and the percentage of FIXED depen-

dency with correct arguments. As for the previous system, its percentage of detected frozen sentences

decreases as the criteria grow more specific, although it is not very significant. The second, and most

important, aspect is that the previous system clearly had a handicap detecting sentences resulting from

transformations with a 71% difference between the two systems for the criteria the previous system was

able to evaluate when it was developed. This happens because while the previous system detects some

transformations, others had not been treated yet. Therefore, the developed system not only improved

the sentences containing transformations, but also expanded the types of transformations treated by the

system.

One final experience consisted in joining the two sets for both systems, and the obtained results are

shown on Table 4.12:

Table 4.12: Number of sentences (manually and artificially generated) identified as frozen according to the defined

criteria, for both systems.

Criteria# Previous

system

% Previous

system

# Developed

system

% Developed


Fixed dependency 2,389 64% 3,518 95% 1129 31%

Same number of arguments 2,147 58% 3,484 94% 1337 36%

Exact arguments 2,121 57% 3,420 92% 1299 35%

In order for the differences between systems to materialize they can be visualized in Figure 4.3.

Figure 4.3: Comparing the performance of the developed system against the performance of the previous one for

the artificially and manually generated sentences.

The difference is clearly visible in this last graph. The gap between systems is quite noticeable,

reaching an 35% difference on the criteria "Exact arguments", the most specific one, with the developed

54

system mantaining a clear advantage. Despite the fact that both systems lose performance as the criteria

become more specific, that decrease is not very substantial in any of the systems. It is very important to

underline that the developed system presents new ways to evaluate its perfomance, not only by finding

the FIXED dependency but also by asserting the correct number of arguments and the correct arguments.

As a final note, it should be to underlined that a very careful analysis of the failures found during the

development of this system was performed, as it had been done for the previous one. This allowed for

the correction of multiple errors on both STRING and the system, and allowed for its performance to

improve greatly, which in turn makes the system more reliable.

55

Chapter 5

Conclusions

THis project aimed at improving the processing of frozen sentences, that is, multiword verbal id-

ioms, in the STRING system. The XIP module, responsible for detecting them, uses rules created

by an existing system, which presented some fragilities particularly when detecting sentences

resulting from transformations of the sentences’ base form. This work contributed to improving this

detection in the following manner:

• A new module that automatically generates sentences resulting from applying them a set of trans-

formations;

• The rule generator was re-written in order to accomodate the transformations that can be applied

to the sentences;

• A new module was built, which automatically validates the output of the examples, comparing

them against what was expected.

Generally speaking this work contributted for improving the overall performance of the STRING

system. It did so by greatly improving the detection of sentences with transformations, as well as intro-

ducing a more thorough way of evaluating every sentence. However, a more in-depth manual validation

of both the generated sentences and the generated rules is still to be performed.

Another factor that was improved was the system’s speed. The main contributor to this is the pipeline

script that automated the process of generating the rules, integrating them into the STRING system and

validating the results of running both manually and artificially generated senteces. Finally, some errors

were detected in the matrix while developing the rule generation. Therefore this work also helped im-

prove the consistency of the lexicon-grammar, clarifying the meaning of some properties there encode,

as well as validating the values present there.

5.1 Future work

As for future work, the following items are suggested, in order to continue improving the system, as

well as further evaluating its performance:

57

• Generate other types of transformations from the matrix description, like [Pass-se] or the sym-

metric construction;

• Build a golden collection from the corpus LE-PAROLE;

• Calculate both precision and recall from that same corpus, containing frozen and non-frozen ex-

pressions. Precision corresponds to the proportion of positive identifications that are actually cor-

rect, that is, the proportion of sentences identified as frozen that are actually frozen. Recall that is

the proportion of actual positives that was identified correctly. Recall is already being calculated

from the set of frozen sentences used in this work, which means that there are no false negatives.

However, in case it would be calculated for a more diverse corpus, there would be false negatives,

and the results would be different;

• Indicate, in the final report containing the results of the validation, the reason for the failure, when

there is one;

• Write, in the matrix, the result of the evaluation and the cause for a failure, when there is one, in

order to automate the error detection, and avoiding manual validation. This will ease the error

correction process;

• During the next evaluation iteration, compare its results with the results from the current version,

and underline differences;

• Generate, in an automatic way, sentences that do not contain one of the frozen elements and,

therefore, is not fixed.

Processing a corpus, such as the European Portuguese annotated corpus built in the scope of the

project PARSEME, which contains both frozen and non-frozen expressions, would be the fact that the

frozen expressions would be challenging for the system. Frozen sentences may present themselves in

the most variate ways, while mixed with other expressions, and that would prove itself interesting to ex-

trinsically evaluate the system, especially given that up until now, the system has only been tested with

texts containing only frozen sentences, and therefore has only been intrisically evaluated. This would

allow for a more comprehensive understanding of how well the system would behave in the real world,

where sentences may not appear as it is expected, on may also appear in the form of transformations of

their base sentences.

In terms of what can be done in the developed code, all the conversions between a value of the matrix

and the XIP code should be described in a declarative way, such as a table, or a dictionary. Coordina-

tion could also be accepted for a constituent, as well as the POS PREDSUBJ and a prefix representing a

container (medidas, lit: ‘measures’).

58

References

[1] BAPTISTA, JORGE. 2005. Construções simétricas: argumentos e complementos. Pages 353–367 of:

FIGUEIREDO, O; RIO-TORTO, GRAÇA & SILVA, F. (eds), Volume de homenagem ao Prof. Mário Vilela.

Fac.Letras-U.Porto.

[2] BAPTISTA, JORGE & MAMEDE, NUNO. 2016. Nomenclature of chunks and dependencies in Portuguese

XIP Grammar 4.6. Technical Report. L2F-Spoken Language Laboratory, INESC-ID Lisboa, Lisboa.

[3] BAPTISTA, JORGE; CORREIA, ANABELA & FERNANDES, GRAÇA. Frozen Sentences of Portuguese:

Formal Descriptions for NLP. Pages 72–79 of: Workshop on Multiword Expressions: Integrating Process-

ing. Barcelona, Spain: ACL, for International Conference of the European Chapter of the Association

for Computational Linguistics.

[4] BAPTISTA, JORGE; FERNANDES, GRAÇA; TALHADAS, RUI; DIAS, FRANCISCO & MAMEDE, NUNO.

Implementing European Portuguese Verbal Idioms in a Natural Language Processing System. Pages

102 – 115 of: CORPAS PASTOR, G. (ED.) (ed), Computerised and Corpus-based Approaches to Phraseology:

Monolingual and Multilingual Perspectives/Fraseología computacional y basada en corpus: perspectivas mono-

lingües y multilingües, Proceedings of Conference of the European Society of Phraseology (EuroPhras 2015).

Málaga, Spain: Editions Tradulex, Geneva.

[5] BAPTISTA, JORGE; MAMEDE, NUNO & MARKOV., ILIA. 2014. Integrating verbal idioms into an

NLP system. Pages 251–256 of: BAPTISTA, JORGE; MAMEDE, NUNO; CANDEIAS, SARA; PARABONI,

IVANDRÉ; PARDO, THIAGO & DAS GRAÇAS VOLPE NUNES, MARIA (eds), Computational Processing of

the Portuguese Language. Lecture Notes in Computer Science / Lecture Notes in Artificial Intelligence,

vol. 8775. Berlin: Springer, for 11th International Conference PROPOR’2014, São Carlos – SP, Brazil,

October 8-10, 2014.

[6] CONSTANT, MATHIEU; ERYIGIT, GÜLSEN; MONTI, JOHANNA; VAN DER PLAS, LONNEKE;

RAMISCH, CARLOS; ROSNER, MICHAEL & TODIRASCU, AMALIA. 2017. Multiword Expression Pro-

cessing: A Survey. Computational Linguistics, 4, 837–892.

[7] DINIZ, CLÁUDIO. 2010. RuDriCo2 : Um Conversor Baseado em Regras de Transformação Declarativas.

Master thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa.

[8] DINIZ, CLÁUDIO; MAMEDE, NUNO & PEREIRA, JOÃO. 2010. RuDriCo2: A faster disambiguator

and segmentation modifier. Pages 573–584 of: Simpósio de Informática - INForum.

59

[9] GROSS, MAURICE. 1982. Une classification des phrases "figées" du français. Revue Québécoise de

Linguistique, 11(2), 151–185.

[10] GROSS, MAURICE. 1996. Lexicon-Grammar. Pages 244–259 of: BROWN, KEITH ; & MILLER, J. (eds),

Concise Encyclopedia of Syntactic Theories. Cambridge: Pergamon.

[11] MAMEDE, NUNO; BAPTISTA, JORGE; CABARRÃO, VERA & DINIZ, CLÁUDIO. 2012. STRING: An

Hybrid Statistical and Rule-based Natural Language Processing Chain for Portuguese. In: Interna-

tional Conference on Computational Processing of Portuguese (PROPOR 2012), vol. Demo Session.

[12] MARTINS, R. T.; HASEGAWA, R.; NUNES, M. G. V.; G. MONTILHA, G. & OLIVEIRA, O. N. 1998.

Linguistic issues in the development of REGRA: a grammar checker for Brazilian Portuguese. Natural

Language Engineering, 4(4), 287—307.

[13] MOKHTAR, SALAH AIT; CHANOD, JEAN-PIERRE & ROUX, CLAUDE. 2002. Robustness beyond

shalowness: incremental dependency parsing. Natural Language Engineering, 121–144.

[14] THE DOCUMENT COMPANY XEROX & XEROX RESEARCH CENTRE EUROPE. 2007a. Xerox Incremen-

tal Parser Reference Guide.

[15] THE DOCUMENT COMPANY XEROX & XEROX RESEARCH CENTRE EUROPE. 2007b. Xerox Incremen-

tal Parser User’s Guide.

[16] VICENTE, ALEXANDRE. 2013. LexMan: um Segmentador e Analisador Morfológico com Transdutores.

Master thesis, Instituto Superior Técnico, Universidade de Lisboa.

60

Appendix A

Conversion to XIP rules

In this annex the tables from Table A.1 to Table A.14 are presented, showing the restrictions imposed

by each class, its instantiation, as well as the corresponding XIP rule.

Table A.1: XIP Rule restrictions and instantiation for the class C1 and the example O João abanou o capacete

C1 - O João abanou o capacete lit: ‘João shaked the helmet’, ‘to

dance’

Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation

N0=Nhum SUBJ(#2,#1) SUBJ(abanou, João)

N1=N-hum CDIR(#2,#3) CDIR(abanou, capacete)

Det1 DETD(#3,?) DETD(capacete, o)

The XIP Rule for the example of Table A.1 is:

if (VDOMAIN(#?,#2[lemma:abanar]) &

CDIR[post](#2,#3[surface:capacete]) &


)

FIXED(#2,#3)

Table A.2: XIP Rule restrictions and instantiation for the class CDN and the example O Rui sondou a opinião da Inês

CDN - O Rui sondou a opinião da Inês lit: ‘Rui sounded Inês’

opinion’, ‘to try to find out one’s opinion’.


N0=Nhum SUBJ(#2,#1) SUBJ(sondou,Rui))

N1=N-hum CDIR(#2,#3) CDIR(sondou,opinião)

N2=Nhum MOD[post](#3,#4) MOD[post](opinião,Inês)

Det1 DETD(#3,?) DETD(opinião,a)

61


if ( VDOMAIN(#?,#2[lemma:sondar]) &

CDIR[post](#2,#3[surface:opinião]) &

MOD[post](#3,#4[UMB-Human])&

PREPD(#4,?[surface:de])

)

FIXED(#2,#3)

Given that this sentence allows for the [PronPos] transformation, the rule becomes the following:

if ( VDOMAIN(#?,#2[lemma:sondar]) &

CDIR[post](#2,#3[surface:opinião]) &

DETD(#3,?[surface:a]) &

( ( MOD[post](#3,#4[UMB-Human]) &

PREPD(#4,?[surface:de]) )

|| POSS(#3,?) )

)

FIXED(#2, #3)

Table A.3: XIP Rule restrictions and instantiation for the class CAN and the example O João matou a fome do Pedro.

CAN - O João matou a fome do Pedro lit: ‘João killed Pedro’s

hunger’


N0=Nhum SUBJ(#2,#1) SUBJ(matou,João)

N1=Nhum CDIR(#2,#3) CDIR(matou,fome)

N2=Nhum MOD[post](#3,#4) MOD[post](fome,Pedro)

Det1 DETD(#3,?) DETD(fome,a)

Prep2 PREPD(#4,?) PREPD(fome,de)


if ( VDOMAIN(#?,#2[lemma:matar]) &

CDIR[post](#2,#3[surface:fome]) &

DETD(#3,?[surface:a]) & MOD[post](#3,#4[UMB-Human]) &


)

FIXED(#2,#3)

Given that this sentence allows for the [PronPos] and [Rdat] transformation, the rule becomes the

following:

62

if ( VDOMAIN(#?,#2[lemma:matar]) &

CDIR[post](#2,#3[surface:fome]) &


( ( ( MOD[post](#3,#4[UMB-Human]) &


|| CLITIC(#2,?[dat]) )

|| POSS(#3,?) )

)

FIXED(#2, #3)

Table A.4: XIP Rule restrictions and instantiation for the class CNP2 and the example O Rui cortou o problema pela

base.

CNP2 - O Rui cortou o problema pela base lit: ‘Rui cut the problem

at its root’


N0=Nhum SUBJ(#2,#1) SUBJ(cortou,Rui)

N1=N-hum CDIR(#2,#3) CDIR(cortou,problema)

Det2 DETD(#4,?) DETD(base,a)

C2 MOD[post](#3,#4) MOD(base,problema)

Prep2 PREPD(#4,?) PREPD(base,por)

The XIP the example of Table A.4 is:

if ( VDOMAIN(#?,#2[lemma:cortar]) &

CDIR[post](#2,#3[UMB-Human: ])) &

MOD[post](#2,#4[surface:base]) &

PREPD(#4,?[surface:por]) &


)

FIXED(#2,#4)

Given that this sentence allows for the [PronA], the rule becomes the following:

if ( VDOMAIN(#?,#2[lemma:cortar]) &

( ( CDIR[post](#2,#3[UMB-Human: ]) ) || CLITIC(#2,#3[acc]) ) ) &

MOD[post](#2,#4[surface:base]) &



)

63

FIXED(#2,#4)

Table A.5: XIP Rule restrictions and instantiation for the class C1PN and the example A Rita afiou os dentes ao

dinheiro.

C1PN - A Rita afiou os dentes ao dinheiro lit: ‘Rita sharpened her

teeth to the money’ ‘to be greedy’


N0=Nhum SUBJ(#2,#1) SUBJ(afiou,Rita)

N1=N-hum CDIR(#2,#3) CDIR(afiou,dentes)

N2=N-hum MOD[post](#3,#4) MOD[post](dentes,dinheiro)

Det1 DETD(#3,?) DETD(dentes,os)


if ( VDOMAIN(#?,#2[lemma:afiar]) &

CDIR[post](#2,#3[surface:dentes]) &

DETD(#3,?[surface:os]) &

MOD[post](#2,#4[UMB-Human: ]) &

PREPD(#4,?[surface:a])

)

FIXED(#2,#3)

Table A.6: XIP Rule restrictions and instantiation for the class C1P2 and the example O casaco custou os olhos da

cara do Rui.

C1P2 - O casaco custou os olhos da cara do Rui lit: ‘The coat

cost the eyes of Rui’s face’ ‘to be very expensive’


N0=Nhum SUBJ(#2,#1) SUBJ(custou,casaco)

N1=N-hum CDIR(#2,#3) CDIR(custou,olhos)

Det1 DETD(#3,?) DETD(olhos,os)

Det2 DETD(#4,?) DETD(cara,a)

C2 MOD[post](#3,#4) MOD[post](olhos,cara)

Prep2 PREPD(#4,?) PREPD(cara,de)


if ( VDOMAIN(#?,#2[lemma:custar]) &

CDIR[post](#2,#4[surface:olhos]) &

64

DETD(#4,?[surface:os]) &

MOD[post](#3,#4[surface:cara]) &

PREPD(#5,?[surface:de]) &

DETD(#4,?[surface:a])

)

FIXED(#2,#3,#4)

Table A.7: XIP Rule restrictions and instantiation for the class CPPN and the example O João comprou gato por

lebre ao Pedro.

CPPN - O João comprou gato por lebre lit: ‘João bought cat for

hare’, ‘to be dupped’.


N0=Nhum SUBJ(#2,#1) SUBJ(comprou,João)

N1=N-hum CDIR(#2,#3) CDIR(comprou,gato)

Det2 DETD(#4,?) DETD(lebre,por)

C2 MOD[post](#3,#4) MOD[post](gato,lebre)

Prep3 PREPD(#5,?[surface:a]) PREPD(Pedr,a)

N3 = Nhum MOD[post](#2,#5[UMB-Human]) MOD[post](comprou,Pedro)


if ( VDOMAIN(#?,#2[lemma:comprar]) &

CDIR[post](#2,#3[surface:gato]) &

MOD[post](#2,#4[surface:lebre]) &


MOD[post](#2,#5[UMB-Human]) &

PREPD(#5,?[surface:a])

)

FIXED(#2,#3,#4)

65

Table A.8: XIP Rule restrictions and instantiation for the class CPP and the example O Zé bate com o nariz na porta

lit: ‘Zé hit with his nose on the door’.

CPP - O Zé bate com o nariz na porta lit: ‘Zé hit with his nose on

the door’, ‘finding a place to be closed or not achieving something’


N0=Nhum SUBJ(#2,#1) SUBJ(bate,Zé)

N1=N-hum CDIR(#2,#3) CDIR(bate,nariz)

Prep1 PREPD(#3,?) PREPD(nariz,com)

Det1 DETD(#3,?) DETD(nariz, o)

Prep2 PREPD(#4,?) PREPD(porta,em)

Det2 DETD(#4,?) DETD(porta,a)

C2 MOD[post](#3,#4) MOD[post](nariz, porta)


if ( VDOMAIN(#?,#2[lemma:bater]) &

MOD[post](#2,#3[surface:nariz]) &

PREPD(#3,?[surface:com]) &


MOD[post](#2,#4[surface:porta]) &

PREPD(#4,?[surface:em]) &

DETD(#4,?[surface:a])

)

FIXED(#2,#3,#4)

Table A.9: XIP Rule restrictions and instantiation for the class CP1 and the exampleO Zé bateu em retirada.

CP1 - O Zé bateu em retirada lit: ‘Zé has withdrawn’ ‘to run away’


N0=Nhum SUBJ(#2,#1) SUBJ(bateu,Zé)

Prep1 PREPD(#3,?) PREPD(retirada,em)

C1 MOD[post](#2,#3) MOD[post](bateu,retirada)



MOD[post](#2,#3[surface:retirada]) &

PREPD(#3,?[surface:em])

)

66

FIXED(#2,#3)

Table A.10: XIP Rule restrictions and instantiation for the class CPN and the example O Zé desceu na consideração

da Ana.

CPN - O Zé desceu na consideração da Ana lit: ‘Zé went down on

Ana’s consideration’.


N0=Nhum SUBJ(#2,#1) SUBJ(desceu,Zé)

N2=Nhum MOD[post](#3,#4) MOD[post](consideração,Ana)

Prep1 PREPD(#3,?) PREPD(consideração,em)

Det1 DETD(#3,?) DETD(consideração,a)

C1 MOD[post](#2,#3) MOD[post](desceu,consideração)


if ( VDOMAIN(#?,#2[lemma:descer]) &

MOD[post](#2,#3[surface:consideração]) &





)

FIXED(#2,#3)

Given that this sentencew allows the transformation [PronD], the rule becomes:

if ( VDOMAIN(#?,#2[lemma:descer]) &

MOD[post](#2,#3[surface:consideração]) &



( ( MOD[post](#3,#4[UMB-Human]) &


|| POSS(#3,?) )

)

FIXED(#2,#3)

67

Table A.11: XIP Rule restrictions and instantiation for the class C0 and the example A sorte bateu à porta do Pedro.

C0 - A sorte bateu à porta do Pedro lit: ‘Luck knocked on Pedro’s

door’.


N0=N-hum SUBJ(#2,#1) SUBJ(bateu,sorte)

N2=Nhum MOD[post](#3,#4) MOD[post](bateu,Pedro)

Prep1 PREPD(#3,?) PREPD(porta,a)

Det1 DETD(#3,?) DETD(porta,a)

C1 MOD[post](#2,#3) MOD(bateu,porta)



SUBJ(#2,#1[surface:sorte]) &






)

FIXED(#2,#1,#3)

This sentence allow for two transformations to be applied to it, [PronPos] and [PronD]. Considering

these two transformations the rule becomes:


SUBJ(#2,#1[surface:sorte]) &




( ( ( MOD[post](#3,#4[UMB-Human]) &

PREPD(#4,?[surface:de]) ) &

|| CLITIC(#2,?[dat]) )

|| POSS(#3,?) )

)

FIXED(#2,#1,#3)

68

Table A.12: XIP Rule restrictions and instantiation for the class C0E and the example Vai pentear macacos!.

C0E - Vai pentear macacos! lit: ‘Go comb monkeys!’, ‘do not

bother me/anyone anymore’.


N1=N-hum CDIR(#2,#3) CDIR(pentear,macacos)

Vc VLINK(#2,#3) VLINK(vai, pentear)


if ( VLINK(#2[lemma:ir],#3[lemma:pentear]) &

CDIR[post](#3,#4[surface:macacos])

)

FIXED(#2,#3,#4)

Table A.13: XIP Rule restrictions and instantiation for the class CADV and the example O Pedro não nasceu ontem.

CADV - O Pedro não nasceu ontem lit: ‘Pedro was not born yes-

terday’, ‘is not dumb’.


N0=Nhum SUBJ(#2,#1) SUBJ(nasceu,Pedro)

NegObrig MOD[neg,pre](#2,#3) MOD[neg](nasceu,não)

Adv1 MOD[post](#2,#4[adv,surface:ontem]) MOD(nasceu,ontem)


if ( VDOMAIN(#?,#2[lemma:nascer]) &

MOD[neg,pre](#2,#3) &

MOD[post](#2,#4[adv,surface:ontem])

)

FIXED(#3, #2, #4)

Table A.14: XIP Rule restrictions and instantiation for the class CV and the example A resposta não se fez esperar.

CV - A resposta não se fez esperar lit: ‘The answer did not take

long to arrive’.


N0=N-hum SUBJ(#2,#1) SUBJ(fez,resposta)

NegObrig MOD[neg,pre](#2,?) MOD[neg](fez,não)

Vc VLINK(#2,#3) VLINK(fez,esperar)

Vse CLITIC(#2,?) CLITIC(esperar,se)

69


if ( VLINK(#2[lemma:fazer],#3[lemma:esperar]) &

MOD[neg,pre](#3,#4) &

CLITIC(#3,#5[ref])

)

FIXED(#4, #2, #3, #5)

70

Appendix B

Readme of the program

+-------------+

| XIPIFICATOR |

+-------------+

2014

[email protected]

2019

[email protected]

O QUE É?

================

Esta é uma aplicação Python que gera automaticamente regras que permitem detectar

expressões fixas. Também gera exemplos que contêm transformações aplicáveis

a determinadas frases. Por último, existe um validador que executa as frases

de exemplo para cada regra na STRING e, a dependência FIXED é extraída, verifica

se o resultado corresponde a um dos três critérios:

• A dependência FIXED foi extraída correctamente;

• A dependência FIXED foi extraída correctamente com o número de argumentos

esperados;

• A dependência FIXED foi extraída correctamente com os argumentos exactamente

iguais aos esperados.

COMO USAR?

================

Gerador de Regras:

python3 bin/xipificator.py -file=FixedExpressions-v9.13.xlsx -sheet=FINAL >

dependencyFPhrase.xip

Copiar as regras geradas para as dependências do XIP, substituindo as existentes

anteriormente:

cp dependencyFPhrase.xip ../xip/ptGram/DEPENDENCIES/

71

Processar as frases manualmente produzidas na STRING:

cat "examples/validate.txt"| ../xip/./string.sh -f -tr -indent -xml > normal.xml

Processar as frases automaticamente geradas, por transformação, na STRING:

cat "examples/examplesPronA.txt"| ../xip/./string.sh -f -tr -indent -xml >

generatedPronA.xml

cat "examples/examplesPronD.txt"| ../xip/./string.sh -f -tr -indent -xml >

generatedPronD.xml

cat "examples/examplesPronP.txt"| ../xip/./string.sh -f -tr -indent -xml >

generatedPronP.xml

cat "examples/examplesPronR.txt"| ../xip/./string.sh -f -tr -indent -xml >

generatedPronR.xml

cat "examples/examplesPassSer.txt"| ../xip/./string.sh -f -tr -indent -xml

> generatedPassSer.xml

cat "examples/examplesPassEstar.txt"| ../xip/./string.sh -f -tr -indent -xml

> generatedPassEstar.xml

cat "examples/examplesRdat.txt"| ../xip/./string.sh -f -tr -indent -xml > generatedRDat.xml

Validar os resultados obtidos STRING:

python3 bin/xipificator_validate.py

COMO CORRER TUDO SEQUENCIALMENTE?

=============================================

./executeXipificator.sh

ESTRUTURA DO FICHEIRO DE ENTRADA, XLSX OU CSV

=============================================

Os ficheiros de regras são compostos por um cabeçalho (primeira linha) e por

um conjunto de regras de expressões fixas, com uma expressão por linha.

Cada coluna contém um elemento da regra (uma flag, lema ou palavra, regra,

exemplo...).

O cabeçalho contém o nome da coluna. As colunas podem surgir em qualquer ordem,

nesse caso o padrão deve ser conhecido e identificado no código (ver matriz

patterns).

Usando nomes pré-definidos para cada coluna, a aplicação pode determinar o

padrão de colunas automaticamente (usando o parâmetro -pattern=AUTO).

De seguida mostram-se os nomes pré-definidos para cada coluna.

Sujeito e Verbo:

-----------

N0 = Nhum : O sujeito é um nome humano (marca flag);

N0 = N-hum : O sujeito (livre) é um nome não-humano (marca flag);

VSe : O verbo deve ser acompanhado por um clítico;

NegObrig : O verbo deve ser acompanhado por um advérbio negativo ou expressão negativa;

72

V : Verbo (superfície ou lema);

PrepLink : Preposição que liga o primeiro verbo da construção a um segundo.

Modificadores:

---------

$ é o número da dependência do modificador, começando em 1

C$ : Cabeça do chunk do modificador;

Prep$ : Preposição (Prep1 é a preposição de C1);

Det$ : Determinante;

Modif$E : Pré-modificador de C$;

Modif$D : Pós-modificador de C$;

Adj$ : Adjectivo que modifica C$;

N$ = Nhum : Palavra na cabeça do chunk $ é um nome humano;

N$ = N-hum : Palavra na cabeça do chunk $ é um nome não-humano;

AttachV$ : Por omissão um modificador N+1 tem dependência ao modificador anterior

N. Marcando esta célula (+) vai criar uma dependência ao Verbo em vez do

ao modificador anterior;

C$Manual : Regra manual do XIP para todo o modificador $. Sobrepõe-se à regra

gerada automaticamente. Útil nos casos de regras com representações excepcionais;

C$ModManual : Regra do XIP para o modificador de C$.

Colunas que indicam a possibilidade de aplicação de transformações à construção:

-------------------------------------------------------------

[PronR$] : Se marcada com ’+’ esta coluna indica que a frase livre N1 pode ser

reduzida a um pronome reflexo .(ex: "O Pedro entregou tudo nas mãos de

Deus" transforma-se em "O Pedro entrega-se nas mãos de Deus");

[PronD$] : Se marcada com ’+’ esta coluna indica que o complemento N$ é distribucionalmente

livre e pode ser reduzida a um pronome dativo (ex: "O Pedro tirou o chapéu

ao João" transforma-se em "O Pedro tirou-lhe ao João");

[PronA$] : A frase livre N$ pode ser reduzida a um pronome acusativo (ex: "O João

viu a Inês pelo canto do olho" transforma-se em "O João viu-a pelo canto

do olho");

[PronPos$] : Se marcada com ’+’ esta coluna indica que a frase preposicional "de

N$" pode ser reduzida a um pronome possessivo (ex: "O Zé fala nas costas

da Ana" transforma-se em "O Zé fala nas suas costas");

[Pass-ser] : Se marcada com ’+’ esta coluna indica que a frase pode ser passada para

a forma passiva, e o verbo copulativo aceite por esta forma é ser (ex:

"A imprensa abafou um escândalo" passa a ser "Um escândalo foi abafado

pela imprensa";

73

[Pass-estar] : Se marcada com ’+’ esta coluna indica que a frase pode ser passada para

a forma passiva, e o verbo copulativo aceite por esta forma é estar (ex:

"A imprensa abafou um escândalo" passa a ser "Um escândalo está abafado

pela imprensa");

[Rdat$] : Se marcada com ’+’, a operação Rdat aplica-se a complementos determinativos

de nome de_Nhum, reestruturando o constituinte maior de que de_Nhum é parte

em dois complementos, nomeadamente passando de_Nhum a a_Nhum e ligando-o

diretamente ao verbo. Este último pode então ser pronominalizado (a_Nhum=>-lhe)

(ex: "O Pedro come as papas na cabeça da Ana" passa a ser "O Pedro come-lhe

as papas na cabeça");

Sim$ : Se marcada com ’+’ esta coluna indica que dois constituintes desta construção

podem ser coordenados numa determinada posição sintáctica (sujeitos simétricos

ou complementos simétricos) e podem trocar de lugar, sem mudar o significado

global da frase (ex: "A Isabel juntou os trapinhos com o Luís" é igual

a "O Luís juntou os trapinhos com a Isabel").

Outras colunas:

----------

AllManual : Opcional. Código manual. Se marcado (+), o conteúdo da célula ‘Manual‘

contém o código para esta expressão;

Manual : Opcional. Código do XIP para esta expressão (se AllManual está marcado

como (+));

Expected : Opcional. Resultado esperado pela lista de dependências do XIP para

esta expressão;

Exemplo : Exemplo do uso desta expressão. Este exemplo é usando como frase de

teste no validador:

Falha : Usado para marcar a causa do erro de validação da expressão. Se vazio,

assume-se que não existe erro. Erro pode ser marcado com o padrão <CÓDIGO>:<PALAVRA

OU EXPRESSÃO ONDE OCORRE> Exemplo: ’ P:casa (entre nome e verbo) ’. Regras

para testar apenas expressões com código de erro podem ser geradas usando

o parâmetro -f (falha). Se marcado como ’?’ assume-se que o erro é ainda

desconhecido. Regras para expressões marcada com este código podem ser

geradas usando o parâmetro -d (dúvida);

Normalized : Um conjunto de predicados é emparelhado com um verbo genérico (ex: "bater

as botas" emparelha com "morrer").

SINTAXE DAS CÉLULAS

=====================

74

Lemas e Superfícies

-------------

Por omissão, uma palavra numa célula indica a superfície de uma palavra. Para

indicar um lema, a palavra deve ser rodeada por < >. Exemplo: palavra indica

superfície, <palavra> indica lema.

Pos:

---

Por omissão, cada coluna da regra tem um tipo de dependência ou POS pre-definida.

(Exemplo: Det1 gera determinantes, POS-MOD gera modificadores) A POS da palavra

ligada por essa dependência pode ser alterada usando um prefixo na palavra.

A POS pode ser definida de duas formas:

• <POS>, quando se aplica a qualquer palavra naquela entrada

• POS:abc, quando se aplica à palavra ’abc’

Exemplo 1: Para indicar que a entrada tem um determinante, deve-se colocar

<DET>.

Exemplo 2: Para indicar que uma palavra ’abc’ é um adjectivo, deve-se colocar

o prefixo A, ficando a entrada na forma A:abc. No caso de ser um lema a entrada

fica na forma A:<abc>.

São definidas no código, na matriz pos que pode ser personalizada com mais

entradas.

Por omissão, o xipificator tem os seguintes POS:

• DET+POS: Determinante e Pronome Possessivo

• PRON+POS: Pronome Possessivo

• PRON+PES: Pronome Pessoal

• DET+DEM: Determinante e Pronome Demonstrativo

• ADV: Advérbio

• A: Adjectivo

• DET: Determinante

• POSDET: Determinante posterior

• Q: Ordinal, Cardinal ou Quantidade

• PREP: Preposição

• CONJ: Conjunção

Flags:

----

As flags permitem adicionar traços específicos a cada entrada da regra. Estas

são definidas depois de um Pos na forma <POS:FLAGS>

Exemplo: um determinante e pronome possessivo no feminino singular é escrito

na forma <DET+POS:fs>

São definidas no código, na matriz flags que pode ser personalizada com mais

75

entradas.

Por omissão, o xipificator tem as seguintes flags:

• s: singular

• p: plural

• m: masculino

• f: feminino

• O: oblíquo (pronome)

Prefixos:

------

Os prefixos permitem mudar o tipo de ligação de dependência e/ou adicionar

traços especiais às palavras.

Exemplo: ’retomar’ deve ser indicada como a palavra (verbo) ’tomar’ com um

prefixo (re-). Para forçar a existência desse prefixo a entrada deve ficar

na forma PFX:<tomar>

São definidas no código, na matriz prefix que pode ser personalizada com mais

entradas.

Por omissão, o xipificador tem os seguintes prefixos:

• MOD: Modificador;

• CDIR: Complemento Directo;

• CIND: Complemento Indirecto;

• FOC: Modificador com Foco;

• PREDSUBJ: Predicativo do Sujeito;

• PFX: Palavra com prefixo.

ESTRUTURA DAS REGRAS GERADAS

============================

O sujeito é marcado como a dependência #1 dentro da regra. O verbo é marcado

como a dependência #2. Os modificadores seguintes são marcados como #3, #4,

etc...

A representação do sujeito é determinada por:

• Um sujeito qualquer com as flags de HUM e/ou N-HUM, conforme indicado nas

regras;

• Um pronome pessoal, usando a dependência SUBJ(?,?[pers]);

• Um pronome relativo, usando a dependência QBOUNDARY.

O verbo é definido pela sua dependência VDOMAIN. Os modificadores podem marcados

com uma dependência CDIR se não tiver preposição ou MOD se tiver preposição.

Determinantes, adjectivos e pré/pós-modificadores são igualmente ligados à

cabeça da dependência.

Caso um dos modificadores contenha uma célula vazia, a existência de uma dependência

76

é aceite mas é opcional.

Caso um dos modificadores contenha a entrada <E> (empty), considera-se que

não existe de uma dependência deste tipo.

Neste último caso, a regra nega a existência de modificadores com MOD(?,?).

HOW-TO

======

ADICIONAR UM NOVO POS OU PREFIXO?

----------------------

1. Procurar a declaração da matrix pos ou prefix em xipificator.py;

2. Adicionar um nova linha à matriz;

3. A primeira coluna é o código da POS, a segunda coluna são os traços que

que a entrada vai ter (deixar em branco em caso de dúvida) e a terceira

coluna é uma ligação de dependência (por omissão vai ser um MOD[post] ou

CDIR[post]) no caso de esta ter de ser alterada;

4. Agora é possível criar entradas da forma POS:palavra ou <POS:flags> ou

PREFIXO:palavra.

ADICIONAR UMA NOVA DEPENDÊNCIA?

---------------------

1. Procurar a declaração da matrix flags ou pos;

2. Adicionar uma linha que relacione uma flag ou POS para uma ligação de dependência

(DEPTAG).

CRIAR O MEU PRÓPRIO PADRÃO DE COLUNAS NO XLSX?

-------------------------------

1. Procurar a declaração da matrix patterns;

2. Copiar a linha para o padrão AUTO e renomear para NAME;

3. For each entry (represented in the top comment) add the index of the column

inside the XSLX file; if V is in column H then add col(’H’) to the entry

for the VERB;

4. Use it passing the argument -pattern=NAME.

77

ADICIONAR UMA COLUNA AO PADRÃO AUTOMÁTICO?

----------------------------

1. Adicionar uma nova contante à lista indicada em #dependency structure com

o formato _NOVA => id diferente de todos os outro marcados;

2. Incrementar o valor da contante DEPENDENCYSIZE em 1 unidade;

3. Na rotina writeDependency adicionar uma chamada para a nova dependência

NOVA_DEPENDENCIA. Exemplo:

(id, fixed, expected) = printDepLink(lineno, ’NOVA_DEPENDENCIA’, pattern,

arr, base, _NOVA, prvid, id, fixed, expected, 0);

4. Na rotina guessPattern, adicionar no else-if em cadeia mais uma entrada

que indique o nome da coluna ’NOVACOLUNA’ a adicionar em que i é o número

do modificador. Exemplo:

elif (str == ("NOVACOLUNA" + i)) pattern[DEPENDENCY1 + (i-1)*DEPENDENCYSIZE

+ _NOVA] = position;

5. Criar a coluna no ficheiro de entrada;

6. A sintaxe será igual a qualquer outro modificador (poderá ser necessário

o uso de prefixos).

LIGAR UM MODIFICADOR AO VERBO EM VEZ DO MODIFICADOR ANTERIOR?

-----------------------------------------

1. Criar a coluna AttachV$ ($ é o número do modificador) caso não exista;

2. Marcar a coluna AttachV$ com um +.

INDICAR QUE UM MODIFICAR É UM COMPLEMENTO INDIRECTO?

-----------------------------------------

1. Colocar CINDIR:palavra na C$ do modificador.

DEFINIR UMA PREPOSIÇÃO PARA O MODIFICADOR SEQUINTE QUANDO ESTE (MOD) NÃO É

CONHECIDO?

---------------------------------------------------------

1. Criar mais um conjunto de colunas para o modificador seguintes;

2. Preencher a coluna da preposição;

3. Deixar C$ em branco ou indicar o seu <POS>.

78

MUDAR O CABEÇALHO DO FICHEIRO DE REGRAS?

---------------------------

1. Alterar a rotina writeHeader no final de xipificator_aux_functions.pl.

79

processing frozen sentences in portuguese€¦ · helena moniz may 2019. acknowledgements ......

Documents