processing frozen sentences in portuguese€¦ · helena moniz may 2019. acknowledgements ......
TRANSCRIPT
Processing Frozen Sentences in PortugueseAutomatic Rule and Example Generation from a Lexicon-Grammar
Ana Isabel Silva Galvão
Dissertation for obtaining the Master Degree in
Information Systems and Computer Engineering
Supervisor(s): Prof. Nuno João Neves Mamede
Prof. Jorge Manuel Baptista
Examination CommitteeChairperson: Prof. Paolo Romano
Supervisor: Prof. Nuno João Neves Mamede
Member of the Committee: Profa. Helena Moniz
May 2019
Acknowledgements
I want to start with thanking Professors Nuno Mamede and Jorge Batista for their tireless help, ad-
vice and useful critiques of this research work. I especially thank Professor Nuno for his crucial (and
very frequent) advice to "simplify things", an ability that I frequently lose. Conciliating my working
schedule with the development of this work was not always an easy task, but both Professors always
gave their best to ease the situation.
I deeply thank my mother Isabel and my father Luís for teaching me the value of hard work, and for al-
ways supporting me unconditionally, for being my safety net and for boosting my confidence whenever
I hard a hard time finding it. I thank my boyfriend Henrique for being so incredibly patient even when
I was tired and mean, and for being there for me in any situation - even when it required being enclosed
inside the house on a lovely sunny day. Finally, a heartfelt thank you to my brother João, who endured
all the thesis jouney with me, day and night, facing the worst days with me, all my tantrums and stress
during this period, always making sure that I never felt alone.
Without them I would not have been able to do this.
Lisbon, 10th of May, 2019
Ana Isabel Silva Galvão
i
Resumo
Expressões fixas são expressões multi-palavra que constituem um grande conjunto da léxico-gramática
de muitas línguas, embora a sua frequência em textos seja, muitas vezes, baixa. Analisar expressões fixas
é uma tarefa desafiante porque estas são conjuntos de palavras sintaticamente analisáveis, mas cujo sig-
nificado é não-composicional. dado um sistema de Processamento de Língua Natural para Português
Europeu, o principal objetivo deste projeto é usar a matriz que contém a mais recente descrição linguís-
tica de forma a conseguir traduzi-la para regras Xerox Incremental Parser (XIP), permitindo ao sistema
não só identificar as frases manualmente produzidas que podem ser encontradas na matriz, mas tam-
bém as automaticamente geradas a partir destas, através da aplicação das transformações permitidas
por cada construção.
De forma a atingir esse objetivo, o gerador de regras foi reconstruído de tal forma que as regras geradas
incluam não apenas a estrutura básica do idioma mas também as várias transformações ou redução de
certos elementos a pronomes que podem ser aplicados a cada frase. Um módulo que gera automatica-
mente este tipo de frases a partir das frases base foi também desenvolvido.
Também foi implementada validação automática de forma a verificar o desempenho do sistema, que foi
globalmente melhorado quando comparado com o sistema anterior, permitindo uma identificação mais
correta e abrangente de expressões fixas.
Palavras-Chave
Processamento de Língua Natural
Expressões Fixas
Idiomas verbais
Expressões multipalavra
Categoria gramatical
iii
Abstract
Frozen sentences are multi-word expressions that constitute a large set of the Lexicon-Grammar
of many languages, though their frequency in texts is often very low. Parsing frozen sentences is a
challenging task because they are syntactically analyzable strings whose meaning is non-compositional.
Given an existing Natural Language Processing (NLP) system for European Portuguese, the main goal
of this project is to use the matrix containing the most recent linguistic description in order to be able to
correctly translate it to XIP rules, allowing for it to identify not only manually produced sentences, but
also automatically generated ones from the base sentences by applying the transformations authorised
by each construction.
In order to achieve that goal, the rule generator was rebuilt so that the generated rules include not only
the basic structure of the idiom, but also the several transformations or reduction of certain elements
to pronouns that may be applied to each sentence. A module that automatically generates this type of
sentences from the base sentences was also developed.
Automatic validation was also implemented in order to verify the performance of the system, which
was overall improved when compared to the previously existent system, allowing for a more correct
and inclusive identification of frozen expressions.
Keywords
Natural Language Processing
Frozen Sentences
Verbal Idioms
Parsing Multiword Expressions
Part of Speech
v
Table of Contents
Acknowledgements i
Abstract v
List of Figures ix
List of Tables xii
List of Acronyms xiii
1 Introduction 1
1.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Frozen Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Statistical and Rule-Based Natural Language Processing Chain (STRING) . . . . . . . . . 8
1.5 XIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Related work 17
2.1 Representing Frozen Expressions on a XLSX file . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Converting XLSX to CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Validating the CSV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.3 Xipificator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Previous Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Solution 25
3.1 Lexicon-Syntactic Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Xipificator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Converter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.2 Rule Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.3 Example Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.4 Example Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
vii
4 Evaluation 43
4.1 Analysing the corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Evaluation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.1 Base sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 Artificially generated sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Previous solution vs. Developed solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Conclusions 57
5.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
References 59
A Conversion to XIP rules 61
B Readme of the program 71
viii
List of Figures
1.1 STRING architecture [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Output tree following pre-processing, disambiguation, and chunking [2]. . . . . . . . . . 13
2.1 General aspect of the matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Modules of the validator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Scheme representing the XIP rules generation; the input is the XLXS file, converted to a
CSV file, that is validated and, in paralel, used for generating XIP rules. . . . . . . . . . . 22
3.1 Comparing the two systems: orange represents what was re-written, green what was added. 26
3.2 Structure of the xipificator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 A schematic representation of the process of generating rules. . . . . . . . . . . . . . . . . 29
3.4 A frozen sentence and the heads of its constituents. . . . . . . . . . . . . . . . . . . . . . . 31
3.5 General mechanism for generating example sentences . . . . . . . . . . . . . . . . . . . . . 36
3.6 Example validation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Comparing the performance (recall) of the developed system against the performance of
the previous one for the manually produced sentences. . . . . . . . . . . . . . . . . . . . . 52
4.2 Comparing the performance of the developed system against the performance of the pre-
vious one for the artificially generated sentences. . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Comparing the performance of the developed system against the performance of the pre-
vious one for the artificially and manually generated sentences. . . . . . . . . . . . . . . . 54
ix
List of Tables
1.1 Summarized Class Structure, where N represents a free noun phrase, while C is a frozen
constituent; the indices 0,1,2 and 3 correspond to the subject, first, second, and thir com-
plements. Prep is a preposition; w is any sequence of complements (eventually none. . . . 7
1.2 Operators and their functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 XIP syntax for POS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 XIP translation for each column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Restrictions to be added to the rule of the base sentence . . . . . . . . . . . . . . . . . . . . 34
4.1 Sentence distribution per class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Manually produced sentences correctly identified as frozen. . . . . . . . . . . . . . . . . . 46
4.3 Artificially generated sentences for [PronA] correctly identified as frozen . . . . . . . . . 48
4.4 Artificially generated sentences for [PronR] correctly identified as frozen . . . . . . . . . 48
4.5 Artificially generated sentences for [PronPos] correctly identified as frozen . . . . . . . . 49
4.6 Artificially generated sentences for [PronD] correctly identified as frozen . . . . . . . . . 49
4.7 Artificially generated sentences for [RDat] correctly identified as frozen . . . . . . . . . . 50
4.8 Artificially generated sentences for [PassSer] correctly identified as frozen . . . . . . . . 50
4.9 Artificially generated sentences for [PassEstar] correctly identified as frozen . . . . . . 50
4.10 Number of manually produced sentences identified as frozen according to the defined
criteria, for both systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.11 Number of artificially generated sentences identified as frozen according to the defined
criteria, for both systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.12 Number of sentences (manually and artificially generated) identified as frozen according
to the defined criteria, for both systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
A.1 XIP Rule restrictions and instantiation for the class C1 and the example O João abanou o
capacete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.2 XIP Rule restrictions and instantiation for the class CDN and the example O Rui sondou a
opinião da Inês . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.3 XIP Rule restrictions and instantiation for the class CAN and the example O João matou a
fome do Pedro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
xi
A.4 XIP Rule restrictions and instantiation for the class CNP2 and the example O Rui cortou o
problema pela base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.5 XIP Rule restrictions and instantiation for the class C1PN and the example A Rita afiou os
dentes ao dinheiro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.6 XIP Rule restrictions and instantiation for the class C1P2 and the example O casaco custou
os olhos da cara do Rui. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.7 XIP Rule restrictions and instantiation for the class CPPN and the example O João comprou
gato por lebre ao Pedro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.8 XIP Rule restrictions and instantiation for the class CPP and the example O Zé bate com
o nariz na porta lit: ‘Zé hit with his nose on the door’. . . . . . . . . . . . . . . . . . . . . . 66
A.9 XIP Rule restrictions and instantiation for the class CP1 and the exampleO Zé bateu em
retirada. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
A.10 XIP Rule restrictions and instantiation for the class CPN and the example O Zé desceu na
consideração da Ana. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
A.11 XIP Rule restrictions and instantiation for the class C0 and the example A sorte bateu à
porta do Pedro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.12 XIP Rule restrictions and instantiation for the class C0E and the example Vai pentear
macacos!. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.13 XIP Rule restrictions and instantiation for the class CADV and the example O Pedro não
nasceu ontem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
A.14 XIP Rule restrictions and instantiation for the class CV and the example A resposta não se
fez esperar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
xii
List of Acronyms
L2F Spoken Language Systems Laboratory.
MWE Multiword Expressions.
NLP Natural Language Processing.
PARSEME PARSing and Multi-word Expressions.
POS Part-of-Speech.
STRING Statistical and Rule-Based Natural Language Processing Chain.
TRE Tree Regular Expression.
XIP Xerox Incremental Parser.
xiii
Chapter 1
Introduction
VErbal idioms are idiomatic (semantically non-compositional), Multiword Expressions (MWE)
consisting of a verb and at least one constraint argument slot [5]. Therefore, they are consid-
ered frozen sentences because the verb and at least one of its arguments are frozen together,
that is, they present idiosyncratic and semantically unpredictable distributional constraints. This means
that, unlike free sentences, their meaning cannot be calculated from each individual component, but
rather from the sentence as a whole [5]. Removing any element and replacing it with something else
would turn the sentence to its literal meaning or result in an unacceptable utterance. However, usually,
one or more of the argument noun phrases are distributionally free, which means that they can vary
(within generic distributional constraints) without affecting the global meaning of the sentence. On the
other hand, this type of sentences also differs from free sentences because they block transformations
that should otherwise be possible, given the syntactic properties of the verb and its arguments [3]. One
example of this type of sentences is: O João abriu os cordões à bolsa
In order for the sentence to mantain its meaning:
• None of the complements may have distributional variations (except for the subject);
• The combination abrir-cordões is frozen;
• cordões cannot be replaced by any other expression, nor be modified by adjectives;
• Replacing à bolsa by any other expression would turn the sentence to its literal sense
Finally, frozen sentences represent a problem for many NLP systems because they must not be
treated as a block [4], on the contrary, they have a syntactic structure that yields to analysis, unlike
compound lexical items (nouns, adverbs, conjunctions, etc). Besides, their elements can appear discon-
tinuously and they may also present some formal variations, often being ambiguous - the same sequence
may have a literal and a figurative meaning - and in that case only an extended context can disambiguate
them [3].
Given these facts, it is possible to conclude that the integration of this specific type of expressions in
NLP systems, in order to obtain an accurate semantic parsing, is a challenging task. A great amount of
1
work has been done in this area, such as an European Portuguese annotated corpus built in the scope of
the project PARSing and Multi-word Expressions (PARSEME)1, an interdisciplinary scientific network
devoted to the role of MWE in parsing2. For the purpose of this project, a previously built lexicon-
syntactic matrix was used, which encodes the linguistic information, using the framework of Gross
[9]. Its information will then be integrated into a fully-fledged NLP system built for Portuguese, the
STRING [11]. The STRING system uses the XIP [5] parser to segment sentences into chunks and extract
dependency relations among chunks’ heads [12]. Considering that most idioms have a “normal” syn-
tactic structure, which follows the ordinary word combinatory rules of the general grammar, STRING’s
strategy consists in parsing them first as ordinary sentences and only then identifying specific word
combinations, whose meaning should not be calculated in a compositional way. The idiomatic word
combinations are identified by a dependency, FIXED, which takes as arguments the verb and the frozen
elements of the idiomatic expression (the number of arguments depends on the type of verbal idiom
involved) [5].
1.1 Goal
The main goal of this dissertation project is to use the matrix containing the most recent linguistic
description in order to be able to correctly translate it to XIP rules, allowing for it to identify not only
manually produced sentences, but also automatically generated ones from the base sentences by apply-
ing the transformations authorised by each construction. In order to do so, three essential tasks were
considered:
• To rebuild the rule generator so that the generated rules include not only the basic structure of the
idiom, but also the several transformations or reduction of certain elements to pronouns that may
be applied to each sentence;
• To create a module that automatically generates sentences resulting from applying the foremen-
tioned transformations to the base sentences;
• To create an automatic validator that compares the expected results to the obtained ones, after run-
ning both manually produced and automatically generated sentences in STRING. This validates
not only the correctness of the generated rules, but also of the generated examples.
1https://typo.uni-konstanz.de/parseme/2For more information on this type of research please refer to [6].
2
1.2 Thesis Structure
The remainder of this document is structured as follows:
• Chapter 2 briefly describes the work developed so far, namely the representation of the frozen
expressions on a lexicon-grammar matrix, saved in an XLSX file, and the previous implementation
for automatic rule generation;
• Chapter 3 presents the changes made to the lexicon-grammar description, as well as the developed
solution during this project;
• Chapter 4 presents the the evalution methods of the developed solution, the results and their
analysis. A comparison between the previous implementation and the developed one is also per-
formed;
• Chapter 5 presents the conclusions taken from this project, as well as the perspectives of future
work.
1.3 Frozen Sentences
Frozen sentences are elementary sentences where the main verb and at least one of its argument
noun-phrases are distributionally constraint, and usually the global meaning of the expression cannot
be calculated from the individual meaning of its constituents when used independently. Therefore, the
expression should be taken as a complex, multiword, lexical unit [3].
To this date, a set of 2,561 European Portuguese verbal idioms has been classified into 15 formal
classes according to their structure and distributional constraints, as well as their syntactic properties.
Table 1.1 shows the breakdown of frozen sentences per class The theoretical and methodological frame-
work of M. Gross [9, 10] was used to classify this type of expressions. This framework bases its classi-
fication on the structure of the sentence, as well as the number and type of arguments of the main verb
[3]. Ten classes were already considerably developed during a previous development, but the remaining
four are still at an early stage. These are the classes C0, C0E, CADV and CV, which are not very numerous.
In a distributionally free sentence, the overall meaning is determined from the individual meaning
of the elements in the construction, but, the meaning of a frozen sentence cannot be directly calculated
from the meaning that the component elements may present when used separately [4]. In Chapter 3
a step by step description of a sentence will be described. Take as an example the following sentence:
O João abriu os cordões à bolsa, lit: ‘João opened the laces to the bag’ ‘to pay for something’. Here,
no element can be substituted while keeping the overall meaning of the sentence, where the verb-object
combination abrir-cordões is frozen, as well as the combination with the instrumental à bolsa, lit: ‘to the
bag’. One can neither replace cordões, lit: ‘laces’ with another word, nor modify it using a free adjective.
Also, removing the instrumental complement à bolsa, lit: ‘to the bag’ and replacing it with something
else would turn the sentence to its literal meaning.
3
A step-by-step generation of a rule for the sentence O João virou o bico ao prego, as represented in
Figure 3.4 may be found on Chapter 3.
However, frozen sentences usually present some, often highly constraint, formal variation. For ex-
ample, in the sentence O João entregou a alma a Deus, lit: ‘João delivered the soul to God’, ‘to die’. In
this case, the noun Deus, lit: ‘God’ could be replaced by Senhor, lit: ‘Sir’. This would not change the
meaning of the sentence, though the variation paradigm is rather short and often unpredictable. The
frozen verb-noun combination is responsible for this distributional constraint, which can be consider-
ably different from the constraints imposed by the verb when functioning as an independent lexical
unit. For example, the verb vender, ‘to sell’, admits both human and non-human (animal and abstract)
nouns for its subject when its object is alma, lit: ‘soul’, but, in the frozen sentence, only human nouns are
allowed [3].
Another example on how this type of sentences differs from free sentences is the blockage of transfor-
mations that should otherwise be possible, given the syntactic properties of the verb and its arguments.
As a free sentence, the passive transformation with the auxiliar verb ser lit: ‘to be’ is applicable to the
example O João abriu o programa com chave de ouro lit: ‘João opened the program with a golden key’.
It becomes O programa foi aberto com chave de ouro pelo João lit: ‘The program was opened with a
golden key by João’.
Direct transitive constructions (without prepositional complements)
C1 This class represents sentences with a fixed direct complement (without any free determina-
tive complements, see class CDN below): A Maria amarrou o burro, lit: ‘Maria tied the donkey’,
‘to pout’. Sentences belonging to this class may suffer the transformations [Pass-ser] and
[Pass-estar]: O burro foi amarrado pela Maria, O burro está amarrado pela Maria, lit: ‘The
donkey was tied by Maria’.
CDN The sentences belonging to this class also feature a frozen direct complement, but its head
contains a free determinative complement (de N, of N); this determinative complement can-
not undergo a dative reestructuring [Rdat]: O João salvou a pele do Presidente, lit: ‘João
saved the skin of the President’, meaning ‘to save someone’. Sentences belonging to this class
may suffer the transformation [PronPos].
CAN This class is similar to CDN, but its free determinative complement might undergo a dative
restructuring, [RDat]. This is a syntactic transformation that splits a complex noun phrase,
where a metonymic (part-whole) relation is observable between N1 and N2 (N1 de N2, N1 of
N2). This originates two constituents, and the second phrase assumes the syntactic function
of indirect (dative) complement: O Manuel quebrou o coração da Maria, lit: ‘Manuel broke
Maria’s heart’, which becomes O Manuel quebrou o coração à Maria, lit: ‘Manuel broke the
heart to Maria’, the new dative complement canm then, undergo the dative pronouning, i.e.
a reduction to a dative pronoun, e.g. O Manuel quebrou-lhe o coração, lit: ‘Manuel broke to
her the heart’, meaning ‘Manuel broke her heart’. Sentences belonging to this class may also
4
suffer the transformation [PronPos].
Direct transitive constructions (containing one prepositional complement)
CNP2 This class has a free direct complement and a fixed prepositional complement: O Eduardo
levantou o Pedro da lama, lit: ‘Eduardo took Pedro out of the mud’ ‘to help someone get out
of a complicated situation’. Sentences belonging to this class may suffer the transformations
[PronR], [PronA], [Pass-ser] and [Pass-estar].
C1PN This class has a fixed direct complement and a free prepositional complement. O Pedro ac-
ertou as agulhas com a Rita, lit: ‘Pedro got the needles straight with Rita’ ‘to get things right
with someone’. Sentences belonging to this class may suffer the transformations [PronD],
[RDat], [PronPos], [Pass-ser] and [Pass-estar].
C1P2 The sentences belonging to this class have a fixed direct complement and a prepositional
fixed complements: O Pedro cortou o problema pela raiz, lit: ‘Pedro cut the problem at its
root’, ‘solve a problem by addressing its causes’. No transformations can be applied to the
sentences from this class.
Prepositional constructions
CP1 This class contains sentences with only one prepositional complement: O Pedro meteu-se
num trinta e um, lit: ‘Pedro got himself into a thirty one’, ‘to get himself into a complicated
situation’. No transformations can be applied to the sentences from this class.
CPN This class is defined by having a prepositional phrase where the head-noun C is frozen with
the verb, while its determinative complement is free [3]: O Manuel foi ao pelo do Pedro, lit:
‘Manuel went to Pedro’s fur’ ‘Manuel hit Pedro’. Sentences belonging to this class may suffer
the transformations [RDat] and [PronPos].
CPP This class contains sentences with two prepositional complements: O Zé bateu com o nariz
na porta, lit: ‘Zé hit with his nose on the door’ ‘finding a place to be closed or not achieving
something’. Sentences belonging to this class may suffer the transformations [PronD] and
[PronPos].
CPPN This class is defined by containing three essential complements where at least one is frozen
with the verb3 O Pedro apanhou o Filipe com a boca na botija, lit: ‘Pedro caught Filipe with
3Because the number of sentences is small, no further sub-classifiction was established as it was done for other structures.
Notice that this class may admit direct complements as well.
5
his mouth on the cannister’, which means ‘to find someone red-handed’. Sentences belonging
to this class may suffer the transformations [PronR], [PronA], [PronD], [RDat], [PronPos],
[Pass-ser] and [Pass-estar].
Other constructions
C0 In this type of constructions, the subject is frozen together with the verb (which might also
accept other complements, either free or fixed). An example of this is: A sorte sorriu ao Pedro,
lit: ‘Luck smiled at Pedro’, ‘Peter was lucky’. Sentences belonging to this class may suffer the
transformations [PronA], [PronD], [RDat] and [PronPos].
C0E This class is constituted by frozen sentences mandatorily in the imperative or exclamative
mode, the subject is often a second person, i.e. the addressee, which is zeroed: Vai pentear
macacos!, lit: ‘Go comb monkeys!’ ‘do not bother me/anyone anymore’. No transformations
can be applied to the sentences from this class.
CADV In these constructions, the verb is frozen together with an adverb (and usually there are no
other complements): O Pedro não nasceu ontem, lit: ‘Pedro was not born yesterday’ ‘is not
dumb’. No transformations can be applied to the sentences from this class.
CV This class includes constructions involving two verbs, usually with a preposition connecting
the first verb V to the second verb Vc. The first verb should not be analyzed as an auxiliary
for the second verb: Ainda está para nascer alguém assim, lit: ‘It is yet to be born someone
like this’, meaning ‘there is no one like this person’. No transformations can be applied to the
sentences from this class.
6
Table 1.1: Summarized Class Structure, where N represents a free noun phrase, while C is a frozen constituent; the
indices 0,1,2 and 3 correspond to the subject, first, second, and thir complements. Prep is a preposition; w is any
sequence of complements (eventually none.
Class Structure Example Count
C1 N0 V C1
O João não abriu a boca
‘be silent’500
CDN N0 V(C of N)1O João atraiu os olhares da Ana
‘draw someone’s eye’44
CAN N0 V(C of N)1 = C1 to N2
O João calou a boca da Ana
‘shut up someone’182
CNP2 N0 V N1 Prep2 C2
O Rui chamou a Inês à razão
‘call to reason’172
C1PN N0 V C1 Prep2 N2
A Maria desligou os aparelhos ao moribundo
‘switch off the machines’255
C1P2 N0 V C1 Prep C2
O João retomou o fio à meada
‘resume the thread’291
CPPN N0 V C1 Prep C2 Prep C3
O João vendeu gato por lebre à Maria
‘sell cat for hare’46
CPP N0V Prep C1 Prep C2
O Zé não morre de amores pela Ana
‘is not fond of’181
CP1 N0V Prep C1
O Zé voltou à carga
‘charge again onto something’662
CPN N0V Prep(C of N)1O Zé caiu nas garras da Ana
‘fall in the claws of’103
C0 C0V wA sorte bafejou o Pedro
‘luck blown over someone’21
C0E V wVai pentear macacos!
‘go comb monkeys’, ‘get lost’1
CADV N0V AdvO Pedro não nasceu ontem
‘was not born yesterday’70
CV N0V (Prep) Vc wA resposta não se fez esperar
‘did not have to wait much for something’13
Total 2,542
7
1.4 STRING
STRING [11] is a hybrid statistical and rule-based NLP chain for Portuguese, that has been developed
by Spoken Language Systems Laboratory (L2F), at INESC-ID Lisboa. STRING has a modular structure
and performs all the basic NLP tasks. The system’s architecture is shown in Figure 1.1.
LexMan [16] is a morphological tagger that receives the result of this segmentation as input and
associates all possible Part-of-Speech (POS) tags to each segment. It receives as input the text to be
processed and starts by tokenizing it, splitting the text into segments. Besides this, the module is also
responsible for the identification at the earliest possible stage of certain special types of tokens, namely:
email addresses, ordinal numbers (e.g. 3o, 42a), thousand and fractional separator (in the Portuguese
language these are the dot . and/or coma , (e.g. 12.345,67), IP and HTTP addresses, integers (e.g. 12345),
several abbreviations with dot . (e.g. a.C., ‘before Christ’), V.Exa., ‘Your Excellency’), numbers written
in full (e.g. duzentos e trinta e cinco, ‘two hundred and thirty-five’), sequences of interrogation and
exclamation marks, as well as ellipsis (e.g. ???, !!!, ?!?!, ...), punctuation marks (e.g. !, ?, ., ,, :, ;, (, ), [, ]),
symbols (e.g. <, >, #, $, %, &, +, -, *, <, >, =, @), and Roman numerals (e.g. LI, MMM, XIV). Naturally,
besides these special textual elements, the tokenizer identifies ordinary simple words, such as alface,
‘lettuce’. It also tokenizes as a single element sequences of words connected by hyphen, most of them
compound words, like fim-de-semana, ‘weekend’ [11]. Next, this module splits the text into sentences.
Then, RuDriCo2 [7, 8] is applied. This module produces rule-based morphological disambiguation
and it also performs segmentation changes to the input, like joining segments (compound words) or
splitting them (contractions). MARv4 ia stochastic morphological disambiguator. It receives the result
of RuDriCo2, and selects selecting the best POS tag to each segment, given its context. Finally, the
last module to apply is XIP [13], a finite-state incremental parser developed by XeroxRCE, which uses
a Portuguese rule-base grammar, and it is responsible for the syntactic analysis4. This module is also
responsible for parsing verbal idioms, therefore a more detailed description is provided in Chapter 1.5.
Figure 1.1: STRING architecture [11]
4The Portuguese grammar for XIP was initially developed under the scope of collaboration between L2F and Xerox Research
Centre Europe, since 2004 [11]. After that, the effort has been invested mainly by the L2F team.
8
1.5 XIP
This module shall be briefly described based on the documents [14, 15]. The parser allows for the
introduction of lexical, syntactic, and semantic information to the output of the previous modules, as
well as performing the syntactic analysis of the text through the following processes:
• Lexicons: allow for the information to be added to the different tokens. In XIP there is a pre-
existing lexicon, which can be enriched by adding lexical entries or changing the existing ones;
• Chunking Rules: perform a shallow parsing or basic syntactic analysis of the text. For each phrase
type (e.g. NP, PP, VP, etc.) a sequence of categories is grouped into elementary syntactic structures,
called chunks. The chunks types depend on the POS of their head element, usually the last element
of the chunk;
• Dependency Rules: dependencies are syntactic dependency relations between different chunks,
chunk heads, or elements inside chunks and they allow a deeper and richer knowledge about the
text’s information and content. Major dependencies correspond to the so-called deep parsing syn-
tactic functions, such as SUBJECT, DIRECT COMPLEMENT, etc. Other dependencies are just auxiliary
relations, mostly used to calculate the deeper syntactic dependencies. For example, the CLINK de-
pendency links each argument of a coordination to the coordinative conjunction it depends on.
A given dependency can be percolated from one argument to the next, in case the sentences are
coordinated phrases.
Verbal idioms are identified by STRING using a dependency FIXED linking the key elements of the
structure (the main verb and frozen head nouns). The lexicon-grammar of verbal idioms was integrated
in the rule-based parsing module of the NLP in the form of parsing rules. Since frozen sentences are
syntactically well formed structuresm complying with the general word-combination rules of grammar,
the following strategy was adopted to parse them. First, general parsing rules can be applied, as to any
other structure. Then, another set of rules extracts the FIXED dependency based on the previous parse,
and groups together the frozen elements of the idiom, while keeping intact the syntactic structure of
the dependency. Finally, the FIXED dependency is the one used to further calculate the semantics of the
sentence [4].
The fundamental data representation unit in XIP is the node. It has a category, feature-value pairs and
brother nodes. Taking as an example the following node:
Pedro: noun[human, individual, proper, first name, people, sg, masc, maj]
This node represents the noun Pedro and it has several features, used to express its properties: Pe-
dro is a noun that represents a human, a masculine individual (feature masc); the node also has features
to describe its number (singular, sg) and the fact that it is spelled with an upper case initial letter (feature
maj). Moreover, features can be instantiated (operator =), tested (operator :), or deleted (operator =˜)
within all types of rules. While instantiation and deletion are all about setting/removing values to/from
9
features, testing consists of checking whether a specific value is set to a specific feature, as showed on
Table 1.2:
Lexicons
XIP allows the definition of custom lexicons (lexicon files), which add new features that are not stored
in the standard lexicon. Having a rich vocabulary in the system can be very beneficial for improving its
recall. In XIP, a lexicon file begins by simply stating Vocabulary:, which tells the XIP engine that the
file contains a custom lexicon. Only afterwards come the actual additions to the vocabulary. The lexical
rules attempt to provide a more precise interpretation of the tokens associated with a node. They have
the following syntax (the parts of the rule contained in parentheses are optional):
lemma(: POS([features])) (+)= (POS)[features].
Examples of lexical rules:
$US = noun[meas=+, curr=+].
eleitor: noun += [human=+].
acenar += verb[vdat=+].
The first two examples show how to add new features to existing words. In the first case, the features
meas (measure) and curr (currency) are added to $US, which is POS-tagged as a noun; in the second case,
the human feature is added to the noun eleitor (elector). In the third case, the word acenar (wave) ,
irrespective of its former POS, is given the additional reading of verb.
Table 1.2: Operators and their functions.
Type Example Explanation
Instantiated [gender=fem] The value fem is set to the feature gender
Deleted [acc=∼] The feature acc is cleared of all values on the node
Tested [gender:fem] Does the feature keygender have the value fem ?
[gender:∼] The feature gender should not be instantiated on the node
[gender:∼fem] The feature gender should not have the value fem
10
Chunking Rules
Chunking is the process by which sequences of categories are grouped into structures; this is done
using chunking rules. There are two types of chunking rules:
• Immediate dependency and linear precedence rules (ID/LP rules);
• Sequence rules.
In order to illustrate the syntax of the chunking rules, a few examples will be used. The first impor-
tant aspect to be taken into account is that each rule must be defined in a specific layer. This layer is
represented by an integer number, ranging from 1 to 300. Below is an example of how to define two
rules in two different layers:
1 > NP = (art;?[dem]), ?[indef1]. // layer 1
2 > NP = (art;?[dem]), ?[poss]. // layer 2
Layers are processed sequentially from the first one to the last. Each layer can contain only one
type of chunking rule. ID/LP rules are significantly different from sequence rules. ID rules describe
unordered sets of nodes and its syntax is the following:
layer> node-name -> list-of-lexical-nodes.
An example of an ID rule is:
1 > NP -> det, noun, adj.
Assuming that det, noun and adj are categories that have already been declared, this rule can be
interpreted as follows: whenever there is a sequence of a determiner, noun and adjective, regardless of
the order in which they appear, create a Noun Phrase (NP) node. Obviously, this rule applies to more
expressions than those desirable, e.g. o carro preto, lit: ‘the car black’, o preto carro, lit: ‘the black car’,
preto carro o, lit: ‘black car the’ and carro preto o lit: ‘car black the’. This is where LP rules come into
play: these rules work with ID rules to establish some order between the categories, while sequence
rules describe an ordered sequence of nodes. By being associated with ID rules, LP rules can apply to
a particular layer or be treated as a general constraint throughout the XIP grammar. LP rules have the
following syntax:
layer> [set-of-features] < [set-of-features].
Considering the following example:
1> [det:+] < [noun:+].
11
1> [noun:+] < [adj:+].
This illustration of chunking rules states that a determiner must preceed a noun on layer one, and
that a noun must preceed an adjective on the same layer (the actual grammatical rules governing the
relative position of adjectives and nouns are much more complex). This means that expressions such as
o preto carro (‘the black car’) will no longer be allowed. However o carro preto, lit: ‘the car black’ will. It
is also possible to use parentheses to express optional categories, and an Kleene star to indicate that zero
or more instances of a category are accepted. The following rule states that the determiner is optional
and that zero or more adjectives are accepted, to form a NP chunk:
1> NP -> (det), adj*, noun.
Considering both LP rules established above, the following expressions are accepted: carro, lit: ‘car’,
carro preto, lit: ‘car black’, o carro preto, lit: ‘the car black’, o carro preto bonito, lit: ‘the car black
beautiful’.
Finally, it is worth mentioning that these rules can be further constrained with right and/or left con-
texts. For example:
1> NP -> | conj |adj, noun | verb |.
This rule states that a conjunction appears at the left of the sequence of categories, and that a verb must
appear at the right side of that sequence. By applying this rule to sentence such as E carros pretos há
muitos na estrada, lit: ‘and black car there are many on the road’, the following chunk will be obtained:
NP[carros pretos].
Despite helping to constraint a rule even further, contexts are not saved inside a node.
The other kind of chunking rules, sequence rules, though conceptually different because they describe
an ordered sequence of nodes, are almost identical to the ID/LP rules as far as their syntax is concerned.
There are, however, some differences and additions:
• Sequence rules do not use the -> operator. Instead, they use the = operator, which matches the
shortest possible sequence. In order to match the longest possible sequence, the @= operator is
used instead;
• There is an operator for applying negation (˜) and another for applying disjunction (;);
• Unlike ID/LP rules, the question mark (?) can be used to represent any category on the right side
of a rule;
• Sequence rules can use variables.
12
The following sequence rule matches expressions like alguns rapazes/uns rapazes, lit: ‘some boys’,
nenhum rapaz, lit: ‘no boy’, muitos rapazes lit: ‘many boys’ or cinco rapazes, lit: ‘five boys’; [indef2]
and [q2] are features of lexical items:
1> NP @= ?[indef2];?[q3];num, (AP;adj;pastpart), noun.
Finally, consider the example O Zé bateu em retirada, lit: ‘Zé has withdrawn’ ‘to run away’. At this
stage, after the pre-processing and disambiguation, and also after applying the chunking rules, the sys-
tem presents the chunking output tree illustrated on Figure 1.2.
Dependency Rules
This step is crucial for a richer understanding of texts. Dependency rules take the sequences of con-
stituent nodes, identified by the chunking rules, and identify syntactic dependency relations between
them. A dependency rule presents the following syntax:
|pattern| if <condition> <dependency_terms>.
In order to understand what the pattern is, first it is essential to understand what is a Tree Regular
Expression (TRE). A TRE is a special type of regular expression that is used in XIP in order to establish
connections between distant nodes. In particular, TREs explore the inner structure of subnodes through
the use of braces ({}). The following example states that a NP node’s inner structure must be examined
in order to see if it is made of a determiner and a noun:
NP{det,noun}.
TRE support the use of several operators, namely:
• The semicolon (;) operator is used to indicate disjunction;
• The Kleene star (*) operator is used to indicate ’zero or more’;
Figure 1.2: Output tree following pre-processing, disambiguation, and chunking [2].
13
• The question mark (?) operator is used to indicate ’any’;
• The circumflex (ˆ) operator is used to explore subnodes for a category.
Hence, and returning to the dependency rules, the pattern contains a TRE that describes the structural
properties of parts of the input tree. The condition is any Boolean expression supported by XIP (with
the appropriate syntax), and the dependency_terms are the consequent of the rule.
The first dependency rules to be executed are the ones that establish the dependencies between the
nodes, as seen in the next example:
|NP#1?*, #2[last] |
HEAD(#2, #1)
This rule identifies HEAD relations (see below) in noun phrases. For example, in the NP a bela rapariga
(‘the beautiful girl’), the rule extracts a HEAD dependency between the head noun rapariga (‘girl’) and the
whole noun phrase — HEAD(rapariga, a bela rapariga).
As already stated, the main goal of the dependency rules is to establish dependencies between the
nodes. The following output is the current result of applying these rules to the sentence O Zé bateu em
retirada, lit: ‘Zé beat in retreat’ ‘to run away’:
MAIN(bateu)
DETD(Zé,O)
VDOMAIN(bateu,bateu)
MOD_POST(bateu,retirada)
SUBJ_PRE(bateu,Zé)
FIXED(bateu,retirada)
NE_PEOPLE_INDIVIDUAL(Zé)
0>TOP{NP{O Zé} VF{bateu} PP{em retirada}}
The last two lines indicate that one named entitiy NE has been captured and classified in this sentence:
Zé has been identified as HUMAN INDIVIDUAL PERSON = PERSON. The tag NE_INDIVIDUAL_PEOPLE is used
to see that the NE have been classified. The other dependencies listed above cover a wide range of binary
dependencies such as:
• The relation between a nominal head and a definite determiner (DETD);
• The verb (MAIN);
• The relation between a modifier and, in this case, the verb it modifies (MOD_POST);
• The subject of the verb (SUBJ_PRE);
• The fixed dependency identified between the verb and the noun (FIXED).
14
To see a complete list and a detailed description of all syntactic dependency relations as of May
2016, please refer to [2]. XIP’s syntax for these conditional statements also allows the operators & for
conjunction and | for disjunction. Parentheses are also used to group statements and establish a clearer
precedence.
15
Chapter 2
Related work
This chapter aims to describe both the architecture and previous behaviour of the system, as well as
the work done so far in the linguistic description.
2.1 Representing Frozen Expressions on a XLSX file
The lexicon-syntactic description of frozen expressions is represented in a matrix, as shown in Figure
2.1, contained in a XLSX file. This matrix is composed by a header and a set of properties for frozen
sentences, and each sentence is described for each sentence is done on a line of the matrix. The first
column refers to the class, represented by a conventional code, defined based on M. Gross’ criteria
[9], for describing frozen sentences. The possible values for this column are the ones defined on the
Subchapter 1.3.
Figure 2.1: General aspect of the matrix.
The first few columns refer to how the rule should be generated
Exotic No rule is generated because the structure of the sentence is atypic, or its use is deemed too
rare;
17
Fail Used to mark the cause for the validation error. If it is empty, it is assumed that there is no
error;
Ignore Determines what should be ignored when generating a rule;
AllManual If this column is checked, the content of the cell Manual will contain the XIP rule for this ex-
pression;
Manual Reserved cell, where the manual rule is inserted. This type of rules describe patterns that
cannot be automatically generated by the system;
Example A sentence to be used for testing with the validator;
Observations Remarks regarding the rule;
Other example Second example to be tested with the generated XIP rule;
Expected This cell contains the expected result to be produced by the XIP’s dependencies list for the
expression. It is used only when there are problems, so that it makes possible to compare
what should be produced with what was, in fact, obtained.
Distributional and verb-related properties
N0 = Nhum The head of the subject NP is a human noun, e.g. Maria;
N0 = N-hum Head of the subject NP of the sentence is not a human noun, e.g. casaco, ‘coat’;
Vse The verb in this expression presents an intrinsically pronominal reflex construction, e.g. fazer-
se de Lucas, ‘pretend to ignore something’;
NegObrig The expression presents a construction containing an obligatory negation modifier, e.g. não
dar para as encomendas, ‘someone who is unable to correspond to the requests”;
V Main verb of the frozen construction;
PrepLink Preposition that links the first verb of the construction to a second verb, both frozen together
(class CV, see Chapter 1.3), e.g. Ainda está para nascer quem me há de ganhar nisto, lit: ‘It is
yet to be born the one who will beat me on this’;
Vc The second verb of a construction with two fixed verbs (class CV) e.g. Este caminho vai dar à
praia, lit: ‘This path leads to the beach’.
18
Constituent’s common components
C0 The lexical element that is the head of the constituent 0;
Det0 The (fixed) determinant of the constituent;
Modif0-E The (fixed) modifier, to the left of the constituent;
Modif0-D The (fixed) modifier, to the right of the constituent;
C0Manual XIP’s manual rule for all the modifiers. Overcomes the rule that was automatically generated.
This is useful when some exceptional rule representation is required.
On the other hand, constituents 1 to 4 contain, besides the fore-mentioned components, the following
ones1:
Prep1 Preposition that introduces C1;
AttachV1 By default, the N+1 noun depends on the verb, unless it is introduced by the preposition
de. By checking this cell with a "+", a dependency to the verb is created instead, rather than
keeping the default dependency to the previous chunk2;
[PronR1] The (free) noun phrase N1 can be reduced to a reflexive pronoun; e.g. Besides O Pedro entre-
gou tudo nas mãos de Deus, ‘Pedro put everything in the hands of God’ one could also find
O Pedro entregou-se nas mãos de Deus, ‘Pedro put himself in the hands of God’;
[PronD1] The complement N$ is distributionally free, and it can be reduced to a dative pronoun3; e.g.
O Pedro tirou o chapéu ao João lit: ‘Pedro took off the hat to João’, after a dative restrusc-
turing (see [Rdat] below, it would become O Pedro tirou-lhe o chapéu, lit: ‘Pedro took off
his hat’;
[PronPos1] The (free) prepositional phrase "de N$" can be reduced to a possessive pronoun; e.g. O Zé
fala nas costas da Ana, ‘Zé speaks behind Ana’s back’ becomes O Zé fala nas suas costas,
‘Zé speaks behind her back’;
1The description is made for index 1, but it is the same for all constituents.2The rules are generated considering the STRING’s operating behaviour3In this pronouning process, the preposition a, ‘to’ (rarely para, ‘to’) is also reduced.
19
[Pass-ser] The auxiliary copulative verb accepted for the passive of this construction is ser, ‘to be’; e.g.
A imprensa abafou um escândalo, ‘The press smothered a scandal’ becomes Um escândalo
foi abafado pela imprensa, ‘A scandal was smothered by the press’;
[Pass-estar] The copulative verb accepted for the passive can be any copulative verb except ser, ‘to be’;
The agentive subject is zeroed in the passive form; e.g. A imprensa abafou um escândalo,
lit: ‘The press smothered a scandal’ becomes Um escândalo está abafado pela imprensa, ‘A
scandal was smothered by the press’;
[Pass-se] This construction admits the pronominal passive form; it is currently not used because it does
not occur very often in verbal idioms;
[Neutra] This construction admits the neutral passive form; it is currently not used;
Normalized A particular set of predicates is paired with a generic verb; e.g bater as botas, lit: ‘kick the
boots’ or ir para o maneta, lit:‘go to the one handed man’ are labeled as morrer, lit: ‘to die’.
Constituent 1 has two extra components:
ADV1 Adverbial complement (fixed) usually an adverb (for class CADV only), e.g. : O Pedro foi em-
bora, ‘Pedro went away’;
[PronA1] The (free) noun phrase N$ can be reduced to an accusative pronoun, e.g. O João viu a Inês
pelo canto do olho becomes O João viu-a pelo canto do olho, lit: ‘João saw Inês from the
corner of his eye’ becomes lit: ‘João saw her from the corner of his eye’ (for classes CNP2,
with a free CDIR);
And components 2 to 4 contain two other exclusive components4:
[Rdat2] If selected, the sentence will allow for a dative restructuring operation, where a determinative
complement de N ‘of N’becomes a dative complement a N ‘to N’ more closely attached to the
verb. This new dative complement is then often reduced to a dative pronoun5, e.g. O João
come as papas na cabeça do Pedro, lit: ‘João eats the mash on Pedro’s head’ ‘to make a
fool out of someone’ becomes O João come-lhe as papas na cabeça, lit: ‘João eats to him
the mash on head’;
4The description is made for index 3, but it is the same for index 4.5[Rdat$] always implies that the constituent can be reduced to a dative pronoun. Hence, whenever [PronD$] is marked
as +, [Rdat$ is - and vice-versa.
20
[Sim2] This property is known as simmetry: two constituents of this construction can be coordinated
in a given syntactic position (either symmetric subjects or symmetric complements) and can
trade places, without changing the global meaning of the sentence [1]; e.g. A Isabel juntou
os trapinhos com o Luís, lit: ‘Isabel gathered her rags with Luís’’, has the same meaning as
O Luís juntou os trapinhos com a Isabel , lit: ‘Luís gathered her rags with Isabel’s’, which
is ‘to get together/married’, hence, the two constituents can be coordinated. A pronominal
copy, also known as echo complement [1], such as um com o outro, can be added to the
sentence with coordinated constituents but this copy is optional. Then the sentence becomes
O Luís e a Isabel juntaram os trapinhos (um com o outro), lit: ‘Isabel and Luís gathered
the rags (with one another)’. Symmetric constructions, i.e., the coordinated forms of these
frozen sentences, have been described and formalized in a separate work [1] and will not be
considered in this project.
2.1.1 Converting XLSX to CSV
A converter from XLSX to CVS has been previously developed. This converter is a simple script that
receives as its argument the XLSX file and transforms each of its cells into a value, separated by a comma.
This CSV file is then used to generate the XIP rules, so it is very important that this conversion does not
fail. For this purpose, the CSV file needs to be validated. This is the matter of the next Subchapter 2.1.2 .
2.1.2 Validating the CSV
The validation of the CSV was is broken down into smaller, but fundamental, subtasks, observable
on Figure 2.2:
1. Validating whether the data is according to what was expected (consistency of each element). Here
the definition of the vectors of classes and possible fields to be ignored is made. This step also
defines the validation matrix, with the following structure: [Column name, Validation Type,
Possible Values];
2. Asserting the consistency between each element, respecting the property of each sentence and
the values of each column, e.g. whether the values are consistent or inconsistent for nouns and
prepositions;
3. Validating class consistency by checking whether each class contains the expected arguments, in-
cluding the symmetry property;
4. Checking the consistency between column values; validation of whether all the restrictions are
being respected and there are no impossible combinations.
This validator was left as-is and it was not subject to any alterations. It was used for validating the
matrix on this project.
21
Figure 2.2: Modules of the validator
2.1.3 Xipificator
The process starts with converting the XLSX file to a CSV file, if necessary. The previous system is
a Pearl application that generated, in an automatic way, XIP rules that would allow for the extraction
of the FIXED dependency. The input was a XLSX file containing a matrix with the lexical, syntactical
and semantic description of 2,520 manually produced frozen expressions, as represented in Figure 2.1.
Its output is a file containing the set of generated XIP rules that are then included in STRING. The con-
ventions used in the matrix are represented in Table 3.1, 27. The notation used for representing the
number/index of the constituent is $. Given that the index ranges from 0 to 4, N$=Nhum can become
N0=Nhum, N1=Nhum and so on. However, the $ will be replaced by 0 in the constituents common to all de-
pendencies (such as determinants, prepositions, modifiers...). This way it is not necessary to enumerate
the same constituents for every dependency.
The system already includes a validation script for the generated rules. This script will instantiate the
XIP rules for each generated example and run it on STRING, as represented on Figure 2.3, later checking
whether the dependencies for the frozen expression are correctly extracted. The pipeline of procedures
allows for a correct identification of a frozen sentence.
2.2 Previous Implementation
The previous implementation for the automatic generation of XIP rules is provided on three files. The
main one is named xipificator.pl. It also uses a file named xipificator_aux_functions.pl and a
Figure 2.3: Scheme representing the XIP rules generation; the input is the XLXS file, converted to a CSV file, that is
validated and, in paralel, used for generating XIP rules.
22
xipificator_validate.pl. These contain both the auxiliary functions necessary for the intermediate
tasks as well as a validator of the generated rules.
The process starts with fetching the necessary arguments, such as the input file, the pattern name
and the XLSX sheet name. Then it converts the XLSX file to a CSV file, if necessary. After this it proceeds
with searching for the corresponding patterns. A pattern defines a correspondence between a column
and an element. If none exists or it is marked as AUTO the system will guess the pattern based on the
names of the filled columns. At last, it writes the rule and a comment containing an example. The
method for writing a rule is the following:
1. Prints the restriction for the verb;
2. Prints the restriction for the negative form of the verb;
3. Prints the restriction for the clitic;
4. Searches for the elements of a dependency and prints their dependency links, and then a function
makes recursive calls until it reaches the last dependency.
The final output for each line of the matrix is a XIP rule, which consists in a set of restrictions that must
be obeyed so that the system extracts the FIXED dependency, thus identifying a construction as frozen.
If the restrictions are not well-formed, that is, if some restrictions are missing or misplaced, the system
may incorrectly extract the dependency or extract it containing the wrong arguments. The validation
performed by this system verified only whether the FIXED dependency had been extracted, ignoring the
correctness of its arguments.
2.2.1 Issues
Given that the sentences are described in a matrix, the system was built around a static number
of columns and attributes. After it was built, the matrix suffered a number of changes, including the
addition of transformations to be applied to the rules. These changes caused the system to stop being
able to generate XIP rules. Besides, neither transformations nor pronominalization had been predicted
on the previous system. This required adopting a new strategy for generating rules that are able to
recognize expressions containing whose elements have undergone these formal changes. However,
there are almost no manually generated sentences that contain transformations. Therefore, these need
to be automatically generated from the manually produced ones, found in the matrix.
Another problem is that the previous system was running one sentence at a time, initializing the system
each time. This results in a big delay in obtaining results, taking around 18 hours to process the sentences
in STRING and obtain results. Finally, in this implementation, the only validation method being used
by this system is to verify whether the FIXED dependency has been extracted. No further verification
concerning the arguments of the dependency is done, and these may not be correct.
23
Chapter 3
Solution
THIS chapter aims at describing the architecture and implementation of the proposed solution.
The main goal of this project is to use the matrix containing the most recent linguistic descrip-
tion and to be able to correctly translate it to XIP rules, allowing for the system to identify not
only manually produced sentences but also other sentences, automatically from the examples encoded
in the matrix by applying the transformations authorised by each construction. In order to do so, the
rule generator was rebuilt so that the generated rules capture not only the basic structure of the idiom,
but also the several transformations or reduction of certain elements to pronouns that may be applied
to each sentence.
A module was created for generating examples in an automatic way after several possible transforma-
tions have been applied to the base sentences, found in the matrix, such as pronominalization and the
passive form. These examples were to be ran on STRING, alongside the base sentences.
Finally, an automatic validator was developed. This validator receives as input the results of processing
all the sentences, manually produced and artificially generated, and compares them against what was
expected.
The main differences between the previous and current systems are represented in Figure 3.1, where the
green modules are the ones that were created from scratch, and the orange modules are the ones that
were restructured or suffered some form of modification. The blue ones were left untouched, but they
were integrated in the system. the inputs and outputs represented in this image will be detailed as each
module is observed in detail.
This chapter will start by describing the structure of the lexicon-syntactic matrix. Following this, the
architecture and implementation of the new modules and the changes performed on the existing ones
will be detailed.
3.1 Lexicon-Syntactic Matrix
To develop this project, a manually produced set of 2,561 European Portuguese verbal idioms was
used. This set is grouped in 15 formal classes according to their structure and distributional constraints.
These are described in a lexicon-syntactic matrix, a XLSX file, which will be used for both rule and ex-
25
Figure 3.1: Comparing the two systems: orange represents what was re-written, green what was added.
ample generation, as shown in Figure 2.1.
The version of the lexicon-syntactic description used in this project is version 13 from April 2019. This
version was meanwhile updated with corrections for the problems found during the development of
this work.
The matrix file starts with a header - containing the name of the column - and it is followed by a set
of properties for identifying a specific frozen sentence, one sentence per line. Each column contains an
element of the rule, or a restriction to it. The meaning of each column may be consulted in Chapter 2
since it has not been changed during this project. During the development of this work, several column
values were either incoherent or even incorrect. Some of these problems were detected while generating
the rules, while others were detected using the previously existing validator of the matrix. Besides
this, due to recent considerations the values of the matrix suffered some evolutions, and some columns
became more inclusive regarding the values they accepted. An example of this is Modif-E, which only
accepted -, <E> or the explicit value of the modifier. Now, it also accepts the +, which means that it can
have any optional modifier to the left of the complement. Whenever mandatory, that modifier needs to
be explicit in the matrix.
Each column represents a restriction to a rule and, by omission, each column has a type of dependency
or a pre-defined POS. The POS of the word connected by that dependency may be altered using a prefix
on that word, as show on Table3.1.
The information provided by these POS may be enriched using flags, added to the POS in the fol-
lowing way: <POS:FLAGS>. For example, a determiner and a possessive pronoun, feminine and plural
is written as <DET+POS:fp>. The currently defined flags are m/f for masculine/feminine gender, s/p
for singular/plural number, and O for oblique (personal pronouns).
Next, the prefixes allow for the change of type of dependency connection or for adding special fea-
tures. One example of this happens with the word retomar. It should be indicated as the word (verb)
26
Table 3.1: XIP syntax for POS
General
No lexical element <E>
Surface prince
Compound expression “prince charming”
Lemma <prince>
Two options of a lemma
for the same entry(<prince> + <princess>)
Two options of a surface
for the same entry(prince + princess)
Parts-of-Speech and inflection tags
One determinant <DET>
A possessive pronoun determinant <DET+POS>
A possessive pronoun determinant,
feminine and plural,
followed by a word (e.g. próprias)
<DET+POS:fp> “próprias”
A demonstrative pronoun
determinant<DET+DEM>
A possessive pronoun <PRON+POS>
A personal pronoun <PRON+PES>
Conventions used for the POS recognized at the moment
Determinant DET
Adjective A
Adverb ADV
Ordinal, cardinal or quantity Q
Preposition PREP
27
tomar with a prefix (re-). In order to force the existence of that prefix the entry should be written as
PFX:<tomar>.
By omission, the system contains the followig features:
• MOD: modifier
• CDIR: direct complementent
• CIND: indirect complementent
• PREDSUBJ: subject’s predicate
• PFX: prefixed word
Before serving as input to the entire system this matrix is validated, as described in Chapter 2.2. One
important thing in this implementation is that the names for each column are pre-defined, so that the
developed program is able to automatically identify its pattern, that is, associate a column to its value
using the column’s name, despite its position. This allows for the columns to appear in any order.
3.2 Xipificator
The Xipificator is a generic designation used for the container of a set of three internal modules, as
seen in Figure 3.2. It takes as input the lexicon-syntactic matrix, in the form of a XLSX file, and it starts
with converting it to a CSV file. Then, this file serves as input to two modules, developed in the scope
of this solution: Rule Generation and Example Generation.
The first one is a module that processes the CSV file and outputs a set of XIP rules to serve input to the
STRING system, and a text file containing each sentence, either manually produced or automatically
generated, the class it belongs to, and its expected output. The second module is a module that uses
the CSV file to output a text file containing a set of sentences - the ones manually inserted in the matrix
and some that were artificially generated from those, containing the transformations each construction
allowed. After STRING processes these examples the output will be written into a text file. A final mod-
ule, the example validator, will compare the output against what was expected, showing the percentage
of correctly identified frozen sentences. A description of each module will be done in the following
Subchapters, starting with the external converter module.
3.2.1 Converter
This module converts XLSX files to CSV files, readable by the Rule Generation and Example Gener-
ation modules, inside the Xipificator. This converter was re-written from the existing one into a Python
module, for integration purposes. It takes as input the XLSX file containing the lexical description ma-
trix. The converter opens the XLSX file, and transforms each of its cells into a value, separated by a
comma. This CSV file will then be used by the remaining two internal modules in order to generate the
rules and the examples, which will be written in two different files.
28
Figure 3.2: Structure of the xipificator
3.2.2 Rule Generation
This module receives as input the CSV file and outputs a XIP file containing the XIP rules generated
from that CSV, represented in Figure 3.2 as ’dependencyFPhrase.xip’, as well as a TXT file, represented in
Figure 3.2 as ’expected.txt’, containing each sentence (manually produced and automatically generated),
the class it belongs to, and its expected output. This will later be used by the example validator. The
CSV file is read into the module, which creates an internal structured representation of each line of the
CSV, and its corresponding values. This allows the information to be encapsulated inside the program
and it is no longer necessary to access external files.
A simplified schema of how the rule generation is performed is shown on Figure 3.3.
Figure 3.3: A schematic representation of the process of generating rules.
The process of generating the rules is complex, given that each line from the matrix is associated to a
corresponding XIP rule, and each possible column value contributes with a restriction to that rule. The
29
module starts with writing the example correponding to the read line. Then, it verifies whether the rule
is manually produced. If so, the rule is read from the column Manual. If not, the rule for that sentences
rule is generated according to the values of the lexicon-syntactic matrix. In case any transformation can
be applied to that sentence, the restrictions associated with that transformation are added to the rule. If
no transformation is applicable, it writes the rule and the expected value to be extracted from XIP for
that sentence.
This translation of each property takes the form of the correspondent dependency where each vari-
able corresponds to the name of the column. Despite the canonical number of the constituent, (0, 1, 2, 3
or 4), all properties, binary or lexical, are translated the same way. Table 3.2 presents the translation of
each column to a XIP restriction. ?V represents the verb of the frozen construction, C1 is the fixed nouns
of the first complement, and so on.
Table 3.2: XIP translation for each column
General
N0=Nhum SUBJ(?V,?[UMB-Human])
N0=N-hum SUBJ(?V,?[UMB-Human:∼])
N1=Nhum MOD[post](?V,?[UMB-Human]) or CDIR[post](?V,?[UMB-Human])
N1=N-hum MOD[post](?V,?[UMB-Human:∼]) or CDIR[post](?V,?[UMB-Human])
Vse CLITIC(?V,[ref])
Vc VLINK(?V,?Vc)
NegObrig MOD[neg](?V,?)
V VDOMAIN(?V,?)
Adv1 MOD[post](?V,[adv])
Modif1-E MOD[pre](?V,?C1)
Modif1-D MOD[post](?V,?C1)
Prep1 PREPD(?C1,?Prep1)
Det1 DETD(?C1,?Det1)
C1 MOD[post](V?,C1?)
PronA1 CLITIC(?V,?[acc])
PronR1 CLITIC(?V,?[ref])
PronD1 CINDIR(?V,?) & CLITIC(?V,?[dat])
PronPos2 POSS(?C2,?)
Pass-ser VDOMAIN(lema[pass-ser],?V)
Pass-estar VDOMAIN(lema[pass-ser:∼],?V)
The dependency representation in the form of variables is organized as follows:
• #? – Free variable
30
• #1 - Subject
• #2 – Verb
• #3 – First complement
• #4 – Second complement
• #5 – Third complement
• #6 – Fourth complement
The subject’s representation is SUBJ(?V,?[UMB-Human]), and the feature UMB-Human determines
whether the subject is human or not (SUBJ(?V,?[UMB-Human:∼])). The verb is defined by the de-
pendency VDOMAIN. This dependency is captures the first and the last verb of an auxiliary verb chain
consisting of one or several auxiliary verbs and a main verb (the last in the chain). Each complement
may be marked as CDIR if it does not have a preposition or MOD if it is1. Determinants and pre-modifiers
or post-modifiers are both connected to the constituents’s head. In case one of the modifiers contains
the value -, the existence of a dependency is accepted but optional. However, if it contains the <E>
entry, it is considered that there is no dependency for that type. Below, an example of the step-by-step
generation process of a rule is presented, for the frozen sentence O João virou o bico ao prego, lit: ‘João
turn the tip to the nail’, ‘to betray’ (class C1P2), which is depicted with its constituents in Figure 3.4.
Figure 3.4: A frozen sentence and the heads of its constituents.
After each dependency tag is associated to a column/column value, the system starts generating
the if() structure of the rule. First, it prints a structure of a dependency link for a verb, and searches
recursively for the elements of a dependency, generating their dependency links, until it reaches the last
dependency.
1. The V column is converted to the restriction as VDOMAIN(#?,#2[lemma:virar]);
2. The first complement, column C1 is converted into a restriction is encoded as CDIR[post] (the post
flag refers to the post-verbial position) because there is no preposition associated to this comple-
ment: CDIR[post] (#2,#3[surface:bico]); So the XIP rule evolves into:
if ( VDOMAIN(#?,#2[lemma:virar]) &
CDIR[post](#2,#3[surface:bico]) &
1At this current stage of parsing no distinction is made yet between essential (argument) complements and adjuncts, so that
MOD dependency functions as an umbrella for both cases.
31
...
)
3. The next restriction to be encoded is Det1, which produces the restriction DETD(#3,?[surface:o]).
The rule evolves to:
if ( VDOMAIN(#?,#2[lemma:virar]) &
CDIR[post](#2,#3[surface:bico]) &
DETD(#3,?[surface:o]) &
...
)
4. The second complement, C2, is then translated as MOD[post](#2,#4[surface:prego]). It is im-
portant to notice that its head is connected to the verb, instead of the previous complement, which
is explicitly marked in the matrix by the property AttachV.
if (VDOMAIN(#?,#2[lemma:virar]) &
CDIR[post](#2,#3[surface:bico]) &
DETD(#3,?[surface:o]) &
MOD[post](#2,#4[surface:prego]) &
...
)
5. Next, Prep2 is encoded as PREPD(#4,?[surface:a]).
The XIP rule now becomes:
if ( VDOMAIN(#?,#2[lemma:virar]) &
CDIR[post](#2,#3[surface:bico]) &
DETD(#3,?[surface:o]) &
MOD[post](#2,#4[surface:prego]) &
PREPD(#4,?[surface:a]) &
...
)
6. Finally, the last column to be encoded as a restriction is Det2, producing DETD(#4,?[surface:o]).
Thi results in the rule:
if (VDOMAIN(#?,#2[lemma:virar]) &
CDIR[post](#2,#3[surface:bico]) &
DETD(#3,?[surface:o]) &
MOD[post](#2,#4[surface:prego]) &
PREPD(#4,?[surface:a]) &
DETD(#4,?[surface:o])
)
32
Finally, to allow for an easier reading and correction the rule is represented in the rules’ file as:
//========================================================
// Example: O João virou o bico ao prego
//========================================================
if ( VDOMAIN(#?,#2[lemma:virar]) &
CDIR[post](#2,#3[surface:bico]) &
DETD(#3,?[surface:o]) &
MOD[post](#2,#4[surface:prego]) &
PREPD(#4,?[surface:a]) &
DETD(#4,?[surface:o])
)
FIXED(#2, #3, #4)
////ORIGINAL O João virou o bico ao prego
////EXPECTED FIXED(virar, bico, prego)
The following step is to integrate the rules in the XIP dependencies file. When running the foremen-
tioned example on STRING, each variable of the rule will then be instatiated according to perfomed
analysis of the elements of the sentence, obtaining the following dependency rules:
VDOMAIN(virou,virou)
CDIR[post](virou,bico)
DETD(bico,o)
MOD[post](virou,prego)
PREPD(prego,a)
DETD(prego,o)
These dependency rules will then be compared against those found in the output provided by XIP:
MAIN(virou)
DETD(João,O)
DETD(bico,o)
DETD(prego,o)
VDOMAIN(virou,virou)
MOD_POST(virou,prego)
SUBJ_PRE(virou,João)
CDIR_POST(virou,bico)
33
Given that the elements of the generated rule are present in the output, and therefore the restrictions
are satisfied, the FIXED dependency is extracted, as FIXED(virou,bico,prego)
A simplified rule generation example per class is presented on Appendix A.
Whenever a transformation may be applied to a sentence, two things happen:
• The Example Generation module will automatically generate a sentence containing the transfor-
mation(s)
• The restrictions relative to that transformations are added to the rule or, in case the transformation
is either [Pass-ser] or [Pass-estar], no restrictions are added to the base sentence rule and,
instead, a new rule is generated for the sentence after the transformation was performed.
The general restrictions for each transformation are described on Table 3.3, however, the passive form
requires a bit more work to be done. First, the verb itself has to be encoded in a different way. Then,
a conversion of the constituents from the base sentence to the passive form is also performed. When-
ever a sentence contains wither a direct complement or a post modifier, it becomes a subject in the new
rule, generated to represent the passive form of that sentence. Using the example O Rui deixou a Inês
em paz, lit: ‘Rui left Inês in peace’ ‘to leave someone alone’, Inês plays the role of direct complement,
CDIR[post](#2,#3[UMB-Human,UMB-Human:∼]), to this construction. However, when transforming it
to the passive form, this will become the subject SUBJ(#2,?). This conversion is performed by a table
that corresponds what each element of the active form should be in the passive form containing, for
now, only the elements CDIR and MOD[post].
Table 3.3: Restrictions to be added to the rule of the base sentence
Restrictions
[PronA] ( CDIR[post](?V,?C1[UMB-Human]) || (CLITIC(?V,?[acc]) || CLITIC(?V,?[ref]) )
[PronR]( CDIR[post](?V,?C1[UMB-Human,UMB-Human:∼]) ||
(CLITIC(?V,?[acc]) || CLITIC(?V,?[ref]) )
[PronD]( MOD[post](?V,?C1[UMB-Human]) & PREPD(?C1,?) ) ||
(CINDIR(?V,?) & CLITIC(?V,?[dat]) ) )
[PronPos] ( ( MOD[post](?C2,?C3[UMB-Human]) & PREPD(?C3,?) ) || POSS(?C1,?) )
[RDat] ( ( MOD[post](?C2,?C3[UMB-Human]) & PREPD(?C3,?) ) || CLITIC(?V,?[dat]) )
[Pass-Ser] VDOMAIN(#?,#2[pass-ser,?V])
[Pass-Estar] VDOMAIN(#?,#2[pass-ser:∼,?V])
A configuration file has been created in order to make it possible to determine which restrictions
are to be applied to the generated rule. The controllable restrictions are determinants, prepositions and
modifiers, both to the left and to the right of the frozen head noun, and any distributional constraints
to any of the free complements/subject. This file also contains the column numbers that correspond to
34
each element to be encoded in the rules. This is done so that the user has total freedom to choose how
restrictively are the rules intended to be applied, which, in turn, will make the system more flexible.
The output of this module consists in two files: one containing the original sentence as well as the
rule that describes it, which will be used by STRING to extract the FIXED dependency; and another,
containing the sentence, the class it belongs to, and the expected output for that sentence.
3.2.3 Example Generation
Using the data structures created from the CSV file, this module outputs a file containing the arti-
ficially generated sentences, represented in 3.2 as ’examples.txt’. These sentences are generated from
the information encoded for the corresponding manually produced sentence, which were produced by
linguists trying to capture the basic structure and distribution of the frozen sentences. The artificially
generated sentences correspond to the form produced by applying the accepted transformations by a
given construction, as encoded in the matrix.
The generation starts with verifying, for each sentence, whether its description contains a positive value
for each column [PronR1], [PronA1],[PronPos2], [Rdat1], [Rdat2], [PronD1], [PronD2], [Pass-estar]
or [Pass-ser]. If so, the system will read, from the description of that sentence, each of its constituents.
Using this, it generates a new sentence (one sentence per transformation), containing the mandatory
complements after the changed required by each transformation have been applied to each. Although,
in an initial phase, these new sentences were written alongside the corresponding base sentence in a text
file, it was later decided that they should be written into separate files, according to the type of transfor-
mation applied to them. This allowed for an easier manual verification and validation of the obtained
sentences for each transformation. They are ran on STRING and later validated separately from the base
sentences, which allows for a clearer distiction of the system’s performance for each type of sentence,
manually produced or automatically generated.
The common mechanism for generating each sentence is described in 3.5. Next, the generation process
for each transformation is detailed.
The verbs in the active form are read from the column V, in the infinitive form. Their third person,
singular, present tense conjugation is read from a file, ’Verb3s.txt’, previously generated by ViPEr. As
for the verbs in the passive form, this is done in a different way, because this form requires the verbs
ser, estar to be added before the main verb. So the main verb is read from the column V, in the infinitive
form, and then its past participle form is read from a file, ’VerbVpp.txt’, previously generated by ViPEr.
In order to generate the subject, and whatever complement not explicit in column C, the columns N$
= Nhum and N$ = Nhum are verified, in order to determine if that constituent is either human or non
human. If it is human, a name is chosen randomly from a list of names. If it is not human, a generic
name, Isso, lit: ‘that’ is used in the generation of the sentence. A description of how the sentences were
generated for each transformation follows:
Generating reflexive pronoun sentences
After the subject is set and the verb is read, the lastest is rewritten it by adding the suffix ’-se’
to it. The following complements, determinants and prepositions are written after the verb. So,
35
Figure 3.5: General mechanism for generating example sentences
using the the description of the sentence O Pedro reduziu a Ana à sua insignificância, lit: ‘Pedro
reduced Ana to her insignificance’ the system generates the sentence: O Pedro reduziu-se à sua
insignificância, lit: ‘Pedro reduced himself to his insignificance’.
Generating accusative pronoun sentences
After verifying the gender of the first complement, to be replaced by the accusative pronoun (and
therefore not written in the sentence), the suffix ’-a(s)’ or ’-o(s)’, ‘him/her’, is added to the verb, if
regular, re-writing it. However, whenever the verb is irregular, namely the ones that finished with
’z’, they had to be processed in a specific way in order to obtain its correct third person, singular,
present tense conjugation. There are two situations:
• The verb trazer, lit: ‘to bring’, which required conjugation is traz, lit: ‘brings’ sees its last two
letters replaced by á-lo. So, traz, is transformed into trá-lo. If had not been implemented, the
transformation would turn the verb into traz-o, which is incorrect.
• Any other verbs ending in ’z’ sees this letter be replaced by -lo. So, for example, conduz,
is transformed into condu-lo. If this had not been implemented, the transformation would
re-write the verb into conduz-o, which is incorrect.
The following complements, determinants and prepositions are written after the verb. So from the
description of the sentence O João tirou o Pedro da lama, lit:‘João took Pedro out of the mud’, ‘to help
someone get out of a complicated situation’ the system generatees O João tirou-o da lama, lit: ‘João took
him from the mud’.
Generating dative pronoun sentences
There are two types of dative pronoun transformations. The first one is relative to the first com-
plement, and it happens whenever [PronD1] is marked positive. After the subject is set and the
verb is read, the suffix ’-lhe’, ‘to him/her’ is added to the verb, re-writing it. Because this replaces
36
the first complement, it is not written in the generated sentence. This way, from the description for
the sentence A sorte bateu ao Pedro, lit: ‘Luck hit to Pedro’, ‘Pedro was lucky’ the system generates
A sorte bateu-lhe, ‘Luck hit him’.
The second type is relative to the second complement, and it occurs whenever the entry [PronD2]
is marked as positive. The suffix added to the verb ’-lhe’ here replaces the second complement.
So from the description of the sentence O João deve favores ao Pedro, lit: ‘João ows favours to
Pedro’ the system generates the sentence O João deve-lhe favores, lit: ‘João ows him favours’. The
following complements, determinants and prepositions are written after the verb.
Generating possessive sentences
In case [PronPos2] is marked as positive, the words seu(s) or sua(s), lit: ‘his’ or ‘hers’, according
to the second complement’s gender, is added ahead of the first complement. Because this replaces
the second complement, it is not written in the generated sentence. For example, when reading
the description of the sentence A sorte bateu à porta do Pedro, lit: ‘Luck hit on the door of Pedro’,
‘Pedro was lucky’, the complement do Pedro, lit: ‘of Pedro’ is ignored and the generation process
replaces with with seu, lit: ‘his’. Therefore the generated sentence is A sorte bate à sua porta, lit:
‘Luck hit on his door’.
Generating dative restructurated sentences
There are two types of dative restructuration transformations, but both transform a de_Nhum to
a_Nhum. The first one is relative to the second complement, and it happens whenever [Rdat2] is
marked positive. The suffix ’-lhe’ here replaces the second complement, while keeping the first.
Therefore, using the description of the sentence A sorte bateu à porta do Pedro, lit: ‘Luck hit to the
door of Pedro’, ‘Pedro was lucky’ the system generates A sorte bateu-lhe à porta, lit: ‘Luck hit on
his door’, The second type is relative to the third complement, and it occurs whenever the entry
[Rdat3] is marked as positive. The suffix ’-lhe’ here replaces the third complement, generating
from the description of the sentence O João entregou o livro em mãos ao Pedro, lit: ‘João delivered
the book in hands to Pedro’ the following: O João entregou-lhe o livro em mãos, lit: ‘João delivered
him the book in hands’. This transformation is mutually exclusive with [PronD], so whenever one
is marked as positive the other cannot be positive as well.
Generating passive sentences
Two types of passive have been considered, namely the one with the auxiliary verb ser and the
one with the verb estar (and its variants, especially ficar, ‘to stay’ and continuar, ‘to continue’, both
in English mean ‘to be’ (the difference is only aspectual). As for the passive transformation, a
positive value in the column [Pass-ser] causes the system, using the description of the example
O Rui arrastou o nome da Rita pela lama, lit: ‘Rui dragged the name of Rita through the mud’ to
generate Isso foi arrastado pela lama, lit: ‘This was dragged by the mud’. The reason for replacing
the constituent nome da Rita lit: ‘name of Rita’ for isso lit: ‘this’: the description only demands
for a non-human name to be present, and therefore nome lit: ‘nome’ can be replaced by a generic
non-human name and da Rita lit: ‘of Rita’ becomes unecessary in the generated sentence.
37
As for the passive transformation using the verb estar, ‘to be’, [Pass-estar], the description of
the example O João controla o Pedro com rédea curta lit: ‘João controls Pedro with short reins’,
‘to very rigorously control someone’ generates O Fernando está controlado com rédea curta lit:
‘Fernando is controlled with short reins’.
The automatic generation of these sentences went through several iterations, as the results were
manually verified by a linguist. The criteria for evaluating them always considered the characteristics
of XIP and its restrictions.
The generation process revealed itself as very much important for the manual validation, either by a
human or a linguist, of the values present in the matrix. It also allows for the detection of a set of
restrictions that might not be represented in the matrix, given that their properties may have not been
studied yet (for example, tense and mood of the verbs).
3.2.4 Example Validation
This example validator receives as input the output that was generated by STRING, written in an
XML file. From here it extracts what was effectively obtained, and builds a text file with this informa-
tion: the sentence, what was expected, and the obtained result, represented in 3.2 as ’output.txt’. Given
that the rule generator already provides the system with what is expected, the next step is to compare
the two: what was expected and what was acquired.
The validator considers 3 criteria in order to evaluate a success and "how much" of a success the de-
tection had, as seen in Figure 3.6. This criteria range from the less specific to the more specific, in the
following order:
1. Checking whether the FIXED dependency was extracted;
2. Checking whether the number of arguments of that dependency matches the expected number of
arguments;
3. Asserting that the arguments of that dependency match the expected arguments.
The results of processing each sentence through STRING is an XML file with each sentence represented
as an LUNIT. Each of these LUNITs will be parsed through until the FIXED dependency is found. When
it is found, the arguments of this dependency are parsed, and their lemmas extracted. The element with
index 0 is the verb, and all the other indexes correspond to the remaining constituents that are part of
that dependency. The obtained dependency FIXED is then built in reverse, from what was obtained -
given that it is from the output - until something such as FIXED(0, 1, 2, 3...) is obtained (with 0
being the verb, 1, 2 and 3 the remaining arguments).
In case no FIXED dependency is extracted, the parser returns "FAILED".
After the example validation is concluded, an output report is generated by the system, contaning a
sentence, its expected value and its result value. In case the FIXED dependency not being at all extracted,
an "X" is written by the beginning of the sentence. This allows for an easier regex isolation of the failed
sentences, and for a more efficient problem solving:
38
Figure 3.6: Example validation criteria
X - Sentence: O João combate moinhos de vento.
Expected value: FIXED(combater, moinhos de vento)
Result value: FAILED
In case the dependency FIXED is extracted but the number of arguments is not the same, the follow-
ing is written on the output file:
Sentence: O Padre disse a missa.
Detected FIXED with wrong number of arguments.
Expected value: FIXED(dizer, missa)
Result value: FIXED(dizer)
As for the case where the dependency FIXED was extracted, the number of arguments is correct, and
the arguments are exactly the ones that were expected, this is the output:
Sentence: A imprensa abafou um escândalo.
The arguments are the same.
Expected value: FIXED(abafar, escândalo)
Result value: FIXED(abafar, escândalo)
This means that the expected extracted dependency is FIXED(abafar, escândalo), and that was
exactly what was extracted. Therefore, the validator will consider this as a successful detection for all
three reasons, and will add it to the number of correctly identified frozen sentences. Although initially
the fact that the FIXED had been extracted was enough to consider this as a successful case, the system
evolved to counting the number of arguments of the FIXED dependency, and checking whether they
matched the expected number of dependencies, and finally evaluating whether they are exactly the
same.
The output file is a report that shows the percentage of correctly identified sentences separated per
class, as well as the global percentage for all sentences.
39
The statistics for each class are presented as a header of all the sentences belonging to that class.
Below the statistics for the manually produced senteces belonging to class C1 are presented:
--------------------------
IDENTIFIED 497 OUT OF 500 SENTENCES, 3 MISSING
STATS FOR CLASS C1: 0.994; 0.006000000000000005 MISSING
IDENTIFIED 495 OUT OF 500 SENTENCES, 5 MISSING
STATS FOR CLASS - NUMBER OF ARGUMENTS: 0.99; 0.010000000000000009 MISSING
IDENTIFIED 486 OUT OF 500 SENTENCES, 14 MISSING
STATS FOR CLASS - ARGUMENTS: 0.972; 0.028000000000000025 MISSING
--------------------------
The global percentage of identified sentences, for all three criteria, is presented in the bottom of each
file:
--------------------------
IDENTIFIED AS FIXED 2430 OUT OF 2542 SENTENCES, 112 MISSING
TOTAL STATS FOR FIXED: 0.955940204563336; 0.04405979543666405 MISSING
IDENTIFIED 2401 OUT OF 2542 SENTENCES, 141 MISSING
TOTAL STATS FOR NUMBER OF ARGUMENTS: 0.9445318646734855; 0.05546813532651451 MISSING
IDENTIFIED 2337 OUT OF 2542 SENTENCES, 205 MISSING
TOTAL STATS FOR ARGUMENTS: 0.9193548387096774; 0.08064516129032262 MISSING
--------------------------
The system was automated using a makefile that runs all the modules. It starts with the rule gener-
ator module, replacing the previous rule set on XIP with the generated ones. It also runs the Example
Generator, and all the examples are ran through STRING. The results are then retrieved and put through
the validator, which then outputs the report. It takes 6 and a half minutes for the system to perform all
these tasks.
This script may be found on Appendix B.
3.3 Improvements
When comparing the developed solution with the previously existing one, mainly by observing Fig-
ure 3.1, several important aspects might be pointed out:
1. The implementation of automatic generation of examples for the passive form and pronominal-
ization is a very important feature because it allows for a variation of the same sentence to be
recognized;
2. The fact that the sentences are not ran one at a time, but in a file instead, allows for a significant
reduction of the time it takes to obtain results. In the previous system processing 2,542 sentences
would take around 18 hours, while in the developed one it takes 6 and a half minutes;
40
3. The new example validator allows for a more detailed detection of frozen sentences and errors. In
the previous system the only factor to be taken into account was whether the FIXED dependency
had been extracted or not. Now, the three different criteria for validation, joined with the role of
the output report plays a big role allow for a more ’spot on’ detection of errors in the generation.
41
Chapter 4
Evaluation
THis chapter describes the evaluation process and methods used in this work. It starts by de-
scribing the structure of the corpus to be evaluated, and the methods used to evaluate it. Then,
the results will be presented, followed by an analysis of the results obtained after processing
this corpus is made. Finally, a comparison between the new solution and the previously existent one, in
terms of the number of frozen sentences were detected, is performed.
Despite the multiple iterations the system traversed in order to further improve its results, the system
had to be frozen at some point, so that the results could be registered. This was done on version 23, May
2019.
The system is initialized by running a make file, which takes around 6 and a half minutes to finish
its execution. This time includes generating the rules and examples, running all the examples through
STRING, and validating the obtained results.
4.1 Analysing the corpus
The corpus to be evaluated was divided in two parts. The first one contains all the manually produced
sentences, or base sentences, that is 2,542 sentences extracted from the matrix; and the second one is a set
of 1,173 artificially generated sentences from the the description of the base sentences, considering the
transformations encoded in the lexicon-grammar matrix. The distribution of generated sentences per
class only considers the entries accepting these transformations. These sentences were manually veri-
fied by a linguist, and they played a big part in the correction and improvement of the lexicon-grammar,
because it is required that their corresponding base sentences lexical description is clear and correct.
Each transformation’s distribution per class, as well as the distribution of the manually produced sen-
tences per class can be observed in Table 4.1. The set of artificially generated sentences was broken down
by transformation, so that each could be evaluated separately. This allows a system’s performance eval-
uation per transformation, rather than evaluating the performance of all the generated sentences.
The joint set corresponds to a total of 3,715 frozen sentences, grouped into classes, as described on Chap-
ter 2, and the evaluation was performed not only globally, but also per class.
43
Table 4.1: Sentence distribution per class.
Class # Manual # [PronR] # [PronA] # [PronD] # [RDat] # [PronPos] # [PassSer] # [PassEstar]
C1 500 0 0 0 0 0 3 3
C0-E 1 0 0 0 0 0 0 0
CDN 45 0 0 0 0 34 0 0
CAN 182 0 0 0 181 178 0 0
CNP2 172 18 172 0 0 0 169 73
C1PN 259 0 0 138 4 3 3 3
C1P2 291 0 0 0 0 0 0 0
CPPN 46 4 15 6 9 3 10 4
CPP 181 0 0 26 0 4 0 0
CP1 662 0 0 0 0 0 0 0
CPN 103 0 0 0 2 96 0 0
C0 21 0 2 5 2 2 0 0
CADV 70 0 0 0 0 0 0 0
CV 13 0 0 0 0 0 0 0
TOTAL 2,542 22 189 176 198 320 185 83
By observing Table 4.1 it should be noted that these classes do not have the same degree of lexical
coverage, as the collection of these is still ongoing or has just recently started1. Despite this, the results
for all the classes will be shown.
Class CP1 is the most significant when considering the set of base sentences, representing around 26% of
the total ammount of sentences. It is followed by C1, which represents around 20% of this set. The less
representative class is C0-E, containing only one entry. Classes C0, CADV, CDN, CPPN and CV are not very
numerous, each of them containing less than 100 sentences.
The transformation with a broader distribution within the lexicon-grammar matrix is the posses-
sive pronominalization ([PronPos]), corresponding to 27% of the total number of generated sentences.
With a lesser scope, the reflexive pronominalization ([PronR]) corresponds to 1,8% of the generated
sentences.
[PronR] and [PronA] occur more regularly on class CNP2, due to the pronominalization of its fixed
direct complement. [PronD] occurs mainly on class C1PN, by pronominalizing its free prepositional
complement. [RDat] and [PronPos] occur more frequently on class CAN, because its free determina-
tive complement might either undergo a dative restructuring or be reduced to a possessive pronoun.
[Pass-ser] and [Pass-estar] occur more regularly on class CNP2, which free direct complement be-
comes a subject. Despite the fact that the number of artificially generated sentences is half the number of
manually produced sentences, these examples are of significant importance because they are a variation
of the base sentences, and they may appear in texts replacing the base sentences.
1The classes are mainly classes C0, C0-E, CADV and CV.
44
4.2 Evaluation method
The evaluation was performed following three criteria:
1. Checking whether the FIXED dependency was extracted;
2. Checking whether the number of arguments of that dependency matches the expected number of
arguments;
3. Asserting that the arguments of that dependency match the expected arguments.
These three criteria were defined so that there was a notion of how exact was the extraction of the
FIXED dependency, answering the question ’was the dependency extracted for the arguments that were
expected?’. One important note to be made is that it is crucial to interpret these results according to their
lexical representativity, that is, the number of expressions in the lexicon, as seen in Table 4.1. Bearing
this in mind, and because the class lexical representativity is not the same for all classes, the total result
is not calculated as an average of the results of the classes. It is calculated for each class, and for the
total ammount of sentences. This means that some classes may have a very low recall, however it is
not critical for the overall picture, especially if they contain a low number of sentences. Therefore an
intrinsic evaluation is performed, by measuring the recall:
Recall =TruePositive
TruePositive+ FalseNegatives
which, in this situation, translates in the number of frozen sentences detected amongst the entire set of
frozen sentences, that it, the proportion of actual positives that were identified correctly. One important
highlight is that there are no FalseNegatives because all the sentences are assumed to be correct. For the
generated sentences that implies performing a thorough manual verification.
4.3 Results
4.3.1 Base sentences
Table 4.2 presents the results obtained for the base sentences, by class and by the total number of
sentences, according to different criteria and different ways to interpret the results.
45
Table 4.2: Manually produced sentences correctly identified as frozen.
Class # Total# Extracted
FIXED
% Extracted
FIXED
# Correct number
of arguments
% Correct number
of arguments
# Correct
arguments
% Correct
arguments
C0 21 20 95,2% 18 85,7% 15 71,4%
C0-E 1 0 0,0% 1 0,0% 0 0,0%
C1 500 497 99,4% 495 99,0% 486 97,2%
C1P2 291 287 98,6% 280 96,2% 265 91,1%
C1PN 259 251 98,4% 245 96,1% 242 94,9%
CADV 70 66 94,3% 66 94,3% 63 90,0%
CAN 182 173 95,1% 173 95,1% 171 94,0%
CDN 45 44 97,8% 44 97,8% 43 95,6%
CNP2 172 170 98,8% 170 98,8% 167 97,1%
CP1 662 618 93,4% 614 92,7% 600 90,1%
CPN 103 82 79,2% 79 76,7% 72 70,0%
CPP 181 167 92,3% 165 91,2% 161 89,0%
CPPN 46 45 97,8% 45 97,8% 45 97,8%
CV 13 10 76,9% 7 53,8% 7 53,8%
TOTAL 2,542 2,430 95,6% 2,401 94,5% 2,337 91,9%
Each line of Table 4.2 refers to a class, indicated in the first column. The last line refers to the total
number of recognized sentences. The second column, named # Total, contains the total number of sen-
tences for that class. The third and fourth columns contain, respectively, the number and the percentage
of sentences from which the FIXED dependency was extracted. The fifth and sixth contain, respectively,
the number and the percentage of sentences from which the FIXED dependency was extracted with the
same number of arguments as expected. The seventh and eighth columns contain, respectively, the
number and the percentage of sentences from which the FIXED dependency was extracted with the ex-
act same arguments as expected. The percentages are calculated dividing the value of the cell by the
total number of sentences of that class.
In an overall observation it is possible to see that the task of recognizing manually produced sentences
was rather successful, with an overall extraction of the FIXED dependency of 95,6% of these sentences.
Knowing that the dependency was extracted, 94,5% had the number of arguments correspondent to
what was expected and 91,9% has the exact same arguments as expected. Therefore there is only a dif-
ference of 3,7% between the number of sentences from which the FIXED dependency was extracted and
the number of sentences with the actual correct arguments, the most specific criteria. This means that
whenever the FIXED dependency is detected, it is very likely that it contains at least the correct number
of arguments. Some errors are related to constructions using past participles, such as O Zé tinha ido
com a cara da Ana, lit: ‘Zé had gone with Ana’s face’, which means to like someone. The rule for this
sentence expects the output FIXED(ir, cara), but the validator builds the extracted output as being
46
FIXED(ter, cara). The error is in the validator itself, which is adding as the main verb the verb ter, lit:
‘to have’, instead of ir, lit: ‘to go’. So, according to the validator, the number of arguments is the same,
but the aguments do not match, but STRING is extracting the dependency correctly. For several other
unsual constructions there are also mismatches between expected and obtained, probably because the
rules generated by the system do not accomodate these constructions. This might be a problem when
extrinsically evaluating the system, due to the fact that the occurence of these constructions may be ele-
vated in texts. Other errors are mainly due to STRING’s wrong POS tagging and disambiguation.
The development of the rule generation was done through several iterations. The result of each iter-
ation required manual validation of the rules, and several problems were detected on STRING, on the
developed system and on the lexicon-grammar description. One example is that STRING did not inter-
pret compound adverbial expressions as a compound, but rather as individual components, therefore
failing to identify as FIXED a lot of the sentences belonging to the CADV. However, on a final phase of
this project, a detailed manual validation of each problem was performed, and corrections were applied
on both the STRING and the developed system. This greatly improved not only the detection of FIXED
sentences, but also of FIXED sentences containing the correct arguments. Before this manual validation
the values ranged from 79% for the most specific criteria to 86,6% to least specific one.
4.3.2 Artificially generated sentences
Tables 4.3, 4.4, 4.6, 4.5, 4.8, 4.9 present the results obtained for the artificially generated sentences,
by class and by the total number of classes, split by transformation. Each line of these tables refers
to a class, represented in the first column. The last line refers to the total number of recognized sen-
tences. The second column, named # Total, contains the total number of sentences belonging to that
class. The third and fourth columns contain, respectively, the number and the percentage of sentences
from which the FIXED dependency was extracted. The fifth and sixth contain, respectively, the num-
ber and the percentage of sentences from which the FIXED dependency was extracted with the same
number of arguments as expected. The seventh and eighth columns contain, respectively, the number
and the percentage of sentences from which the FIXED dependency was extracted with the exact same
arguments as expected. The percentages are calculated dividing the value of the cell by the total number
of sentences corresponding to that class. This total number of sentences refers only to the sentences to
which the mentioned transformation may be applied.
Following these tables, an overall analysis of the obtained results is performed.
47
Table 4.3: Artificially generated sentences for [PronA] correctly identified as frozen
Class # Total# Extracted
FIXED
% Extracted
FIXED
# Correct number
of arguments
% Correct number
of arguments
# Correct
arguments
% Correct
arguments
C0 2 2 100,0% 2 100,0% 2 100,0%
CNP2 172 156 90,1% 156 90,1% 156 90,1%
CPPN 15 12 80,0% 12 80,0% 12 80,0%
TOTAL 189 170 90,0% 170 90,0% 170 90,0%
For the [PronA] transformation, described in Table 4.3, the obtained results were very satisfactory.
Only 19 sentences were not identified as frozen, and every sentence detected as FIXED contained the
expected arguments. The reasons for this are probably due to faulty functioning of the STRING. One
example of this is the sentence O Filipe conhece-o de nome, lit: ‘Filipe knows him by name’, and several
similar sentences containing the determinant de, lit: ‘by’. Ths chain is not able to extract the FIXED
dependency for this type of sentences.
Table 4.4: Artificially generated sentences for [PronR] correctly identified as frozen
Class # Total# Extracted
FIXED
% Extracted
FIXED
# Correct number
of arguments
% Correct number
of arguments
# Correct
arguments
% Correct
arguments
CNP2 18 17 94,4% 17 94,4% 17 94,4%
CPPN 4 4 100% 4 100% 4 100%
TOTAL 22 21 95,5% 21 95,5% 21 95,5%
The frozen sentences identification process was very much successful on the [PronR] transforma-
tion, as shown on Table 4.4, with a 95,5% identification on all three criteria, and only one sentence failing.
However, this transformation is also the one that contains the smallest number of elements, 22, being,
therefore, the least representative of all transformations, and each failure takes a toll on the calculations.
The only sentence from which the dependency FIXED fails to be extracted is O Fernando vê-se ao
perto. The rule for this expressions expects a MOD[post](se,"ao perto"), and this is not extracted by
STRING.
48
Table 4.5: Artificially generated sentences for [PronPos] correctly identified as frozen
Class # Total# Extracted
FIXED
% Extracted
FIXED
# Correct number
of arguments
% Correct number
of arguments
# Correct
arguments
% Correct
arguments
C0 2 2 100% 2 100% 2 100%
C1PN 3 2 66,7% 2 66,7% 2 66,7%
CAN 178 176 98,9% 176 98,9% 176 98,9%
CDN 34 33 97,1% 33 97,1% 3 97,1%
CPN 96 78 81,3% 77 80,2% 77 80,2%
CPP 4 4 100,0% 4 100,0% 4 100,0%
CPPN 3 3 100,0% 3 100% 3 100%
TOTAL 320 298 93,1% 297 92,8% 297 92,8%
The generated sentences for the transformation [PronPos] achieved, overall, very good results, as
shown on Table 4.5. One example of a problem in the rule generation is the sentence FIXED is O Pedro
quis mal à Maria, lit: ‘Pedro wanted harm to Maria’, wishing bad things to happen to someone. STRING
extracts a CDIR_POST(quer,seu), and the rule expects a MOD_POST(quer,seu). Another example, this
time due to errors in the chain, is the sentence: O Henrique acaba com a sua raça, lit: ‘Henrique ends
with someone’s race’, to kill someone. Here, all the obtained restrictions are expected by the rule, but
the dependency is not extrated, probably due to disambiguation issues with the word raça.
Table 4.6: Artificially generated sentences for [PronD] correctly identified as frozen
Class # Total# Extracted
FIXED
% Extracted
FIXED
# Correct number
of arguments
% Correct number
of arguments
# Correct
arguments
% Correct
arguments
C0 5 5 100% 5 100% 5 100%
C1PN 138 131 94,9% 131 94,9% 131 94,9%
CPP 26 21 80,8% 21 80,8% 21 80,8%
CPPN 7 3 42,9% 3 42,9% 3 42,9%
TOTAL 176 160 90,9% 160 90,9% 160 90,9%
The sentences generated by the [PronD] were, for the most part, adequately parsed, achieving very
good results, as shown on Table 4.6. Every time the FIXED dependency is extracted, it is extracted
containing the correct arguments.
Some failures are related to the fact that the generated rules are missing some components. One example
of this is the sentence generated from the description of the sentence O João entregou em mãos o livro
ao Pedro, lit: ‘João handed in hands the book to Pedro’. What is being generated is the sentence O João
entrega-lhe, lit: ‘João delivers to him’, while it should be O João entrega-lhe algo em mãos, lit: ‘João
delivers to him something in hands’.
49
Table 4.7: Artificially generated sentences for [RDat] correctly identified as frozen
Class # Total# Extracted
FIXED
% Extracted
FIXED
# Correct number
of arguments
% Correct number
of arguments
# Correct
arguments
% Correct
arguments
C0 2 2 100,0% 2 100% 2 100%
C1PN 4 4 100% 4 100,0% 4 100,0%
CAN 181 178 98,3% 178 98,3% 178 98,3%
CPN 2 2 100,0% 2 100,0% 2 100,0%
CPPN 9 9 100,0% 9 100,0% 9 100,0%
TOTAL 198 195 98,0% 195 98,0% 195 98,0%
The transformation [RDat] obtained great results, as shown on Table 4.7. Every time the FIXED
dependency is extracted, it contains the expected arguments. One example of a failure that is due to
STRING’s disambiguation issues is O João corta-lhe as vazas, which has no literal translation to en-
glish, but it means to make someone’s plans more difficult, where vazas is labeled as a verb, but in this
context is a name. Another STRING related problem happens in the sentences O João não lhe largava
a braguilha, lit: ‘João would not release his fly’ and O João não lhe largava a porta, lit: ‘João would not
release his door’. Their rules expect to find a CDIR[post](largava,braguilha), and instead found a
MOD[post](largava,braguilha).
Table 4.8: Artificially generated sentences for [PassSer] correctly identified as frozen
Class # Total# Extracted
FIXED
% Extracted
FIXED
# Correct number
of arguments
% Correct number
of arguments
# Correct
arguments
% Correct
arguments
C1 3 2 66,7% 2 66,7% 2 66,7%
C1PN 3 3 100,0% 3 100,0% 3 100,0%
CNP2 169 165 97,6% 164 97,0% 164 97,0%
CPPN 10 9 90,0% 8 80,0% 8 80,0%
TOTAL 185 179 96,8% 177 95,7% 177 95,7%
Table 4.9: Artificially generated sentences for [PassEstar] correctly identified as frozen
Class # Total# Extracted
FIXED
% Extracted
FIXED
# Correct number
of arguments
% Correct number
of arguments
# Correct
arguments
% Correct
arguments
C1 3 2 66,7% 2 66,7% 2 66,7%
C1PN 3 3 100,0% 3 100,0% 3 100,0%
CNP2 73 63 86,3% 62 84,9% 62 84,9%
CPPN 4 4 100,0% 3 75,0% 3 75,0%
TOTAL 83 72 86,7% 70 84,3% 70 84,3%
50
The passive transformation with both verbs ser and estar, ‘to be’ presented very good results, as
shown on Tables 4.8 and 4.9. Most issues are common to both types of passives.
Some errors are related to sentences containing the preposition por, ‘by’, Isso foi cortado pela raiz, lit:
‘That was cut off by the root’, which rule expects COMPL[post](#2,#3[surface:raiz]), and instead
finds MOD[post](#2,#3[surface:raiz]).
Other errors are related to the POS tagging performed by STRING. The rule for the sentence Isso foi
reduzido à expressão mais simples, lit: ‘This was reduced to the simplest expression’, expects a com-
posed adverbial expression as a modifier, MOD[post](reduzido,expressão mais simples). How-
ever, STRING breaks down the expression into two different modifiers, MOD[post](reduzido,expressão)
and MOD[post](expressão,simples). This prevents the FIXED dependency extraction for this rule.
It is possible to observe that the system was very much successful in detecting the automatically
generated sentences from the base sentences’ description by applying the transformations authorised
by each construction, having achieved above 93% recall. The difference between criteria for these sen-
tences is much smaller than for the manually generated sentences. On the final round of manual ver-
ifications, the sentences left knowingly unrecognized often have unsolvable problems, related to word
disambiguation and POS tagging. One important remark to make is that these it is the first time that
such a number of artificially generated sentences and the obtained results were very much satisfactory
given that are were not only there are no small recall values, they are all above , its average value, for
this type of sentences, is 93%.
As a final experience, a set of non fixed sentences was manually produced. This was done using
randomly selected fixed sentences and deforming them, removing some of its fixed complements. This
was done in order to check whether the system would identify them as non fixed. From a set of 513
sentences, 434 were detected as non-fixed. The remaining 79 are probably due to the sentence still
being too similar to the fixed one, or to the fact that the rules do not contain enough restrictions for a
complement or determinant.
4.4 Previous solution vs. Developed solution
After obtaining all the results from this system, it was deemed interesting to compare them against
the results that would be obtained for the same corpus, but using the previously existent system. No-
tice that the previous system also produced the XIP rules from the lexicon-grammar matrix, even if it
had been developed from a previous stage of development of the linguistic description, namely, for a
slightly smaller (yet similar) set of frozen sentences. This set contained 2,520 frozen sentences, against
the 2,542 current base sentences, and 3,715 base sentences joint with the automatically generated sen-
tences. Although there was no significant increase of the number of sentences, the criteria for belonging
to a certain class became more and more specific, and the description of each class was perfected over
time.
51
Doing so yielded the results seen in Tables 4.10 and 4.11. Due to the low percentage of identified
transformed sentences in the previous system, the sentences were merely grouped into manually pro-
duces sentences and artificially generated ones, instead of detailing each transformation separately. The
total number of sentences, used to calculate the recall is 2,542 for sentences present in the matrix, and
1,173 for the artificially generated sentences.
Figures 4.1, 4.2, 4.3 compare the two systems in several ways, described below. The blue line corre-
sponds to the developed system, and the green line corresponds to the previously existent system.
Starting with Figure 4.1, it shows a comparison between the percentage of manually produced sentences
identified by each system, according to the three different criteria presented above.
Table 4.10: Number of manually produced sentences identified as frozen according to the defined criteria, for both
systems.
Criteria# Previous
system
% Previous
system
# Developed
system
% Developed
system# Difference % Difference
Fixed dependency 2,149 84% 2,430 96% 281 12%
Same number of arguments 1,892 75% 2,401 94% 501 19%
Exact arguments 1,796 74% 2,337 92% 541 18%
Figure 4.1: Comparing the performance (recall) of the developed system against the performance of the previous
one for the manually produced sentences.
There are two details that can be observed immediately:
• As the criteria grow more specific, there is a small decrease on recall for the developed system,
and a higher decrease for the previous one;
• The developed system mantains higher values on all criteria.
Reading the data on Figure 4.1, it can be observed that both systems start on a very different thresh-
old, with a 12% difference between them, with the developed system on the lead. After this, it mantains
its advantage by keeping a comfortable margin while veryfing the number of arguments, as well as
52
their correspondence to what was expected. This shows a consistent system, with small variations be-
tween criteria: the difference between the recall of sentences from which the FIXED dependency was
extracted with the same number of arguments and the recall of sentences from which the FIXED depen-
dency was extracted with the exact same arguments is around 4%, whilst for the previous system is 10%.
This means that, for the developed system, not only there is a higher probability for the dependency
FIXED to be extracted, but it is also very likely to contain the correct arguments, or at least the same
number of arguments.
One of the most important things to take into consideration here is that, despite the fact that not only
the system was improved in order to identify a wider range of sentences and their transformations, its
performance improved greatly when compared to the previous one. The fact that the correct identifica-
tion of arguments is also evaluated contributes for the results to be more trustworthy.
As for Figure 4.2, it shows a comparison between the percentage of artificially generated sentences
identification in both systems.
Table 4.11: Number of artificially generated sentences identified as frozen according to the defined criteria, for
both systems.
Criteria# Previous
system
% Previous
system
# Developed
system
% Developed
system# Difference % Difference
Fixed dependency 257 22% 1095 93% 838 71%
Same number of arguments 241 21% 1090 93% 849 72%
Exact arguments 240 20% 1090 93% 850 72%
Figure 4.2: Comparing the performance of the developed system against the performance of the previous one for
the artificially generated sentences.
53
By analyzing this figure it is possible to observe two important aspects. The first one is that, despite
the existing difference, it is not as accentuated between criteria as there is for the manually produced
sentences, both systems react in different way to the criteria tuning. For the developed system, there is
no variation between the percentage of extracted FIXED dependency and the percentage of FIXED depen-
dency with correct arguments. As for the previous system, its percentage of detected frozen sentences
decreases as the criteria grow more specific, although it is not very significant. The second, and most
important, aspect is that the previous system clearly had a handicap detecting sentences resulting from
transformations with a 71% difference between the two systems for the criteria the previous system was
able to evaluate when it was developed. This happens because while the previous system detects some
transformations, others had not been treated yet. Therefore, the developed system not only improved
the sentences containing transformations, but also expanded the types of transformations treated by the
system.
One final experience consisted in joining the two sets for both systems, and the obtained results are
shown on Table 4.12:
Table 4.12: Number of sentences (manually and artificially generated) identified as frozen according to the defined
criteria, for both systems.
Criteria# Previous
system
% Previous
system
# Developed
system
% Developed
system# Difference % Difference
Fixed dependency 2,389 64% 3,518 95% 1129 31%
Same number of arguments 2,147 58% 3,484 94% 1337 36%
Exact arguments 2,121 57% 3,420 92% 1299 35%
In order for the differences between systems to materialize they can be visualized in Figure 4.3.
Figure 4.3: Comparing the performance of the developed system against the performance of the previous one for
the artificially and manually generated sentences.
The difference is clearly visible in this last graph. The gap between systems is quite noticeable,
reaching an 35% difference on the criteria "Exact arguments", the most specific one, with the developed
54
system mantaining a clear advantage. Despite the fact that both systems lose performance as the criteria
become more specific, that decrease is not very substantial in any of the systems. It is very important to
underline that the developed system presents new ways to evaluate its perfomance, not only by finding
the FIXED dependency but also by asserting the correct number of arguments and the correct arguments.
As a final note, it should be to underlined that a very careful analysis of the failures found during the
development of this system was performed, as it had been done for the previous one. This allowed for
the correction of multiple errors on both STRING and the system, and allowed for its performance to
improve greatly, which in turn makes the system more reliable.
55
Chapter 5
Conclusions
THis project aimed at improving the processing of frozen sentences, that is, multiword verbal id-
ioms, in the STRING system. The XIP module, responsible for detecting them, uses rules created
by an existing system, which presented some fragilities particularly when detecting sentences
resulting from transformations of the sentences’ base form. This work contributed to improving this
detection in the following manner:
• A new module that automatically generates sentences resulting from applying them a set of trans-
formations;
• The rule generator was re-written in order to accomodate the transformations that can be applied
to the sentences;
• A new module was built, which automatically validates the output of the examples, comparing
them against what was expected.
Generally speaking this work contributted for improving the overall performance of the STRING
system. It did so by greatly improving the detection of sentences with transformations, as well as intro-
ducing a more thorough way of evaluating every sentence. However, a more in-depth manual validation
of both the generated sentences and the generated rules is still to be performed.
Another factor that was improved was the system’s speed. The main contributor to this is the pipeline
script that automated the process of generating the rules, integrating them into the STRING system and
validating the results of running both manually and artificially generated senteces. Finally, some errors
were detected in the matrix while developing the rule generation. Therefore this work also helped im-
prove the consistency of the lexicon-grammar, clarifying the meaning of some properties there encode,
as well as validating the values present there.
5.1 Future work
As for future work, the following items are suggested, in order to continue improving the system, as
well as further evaluating its performance:
57
• Generate other types of transformations from the matrix description, like [Pass-se] or the sym-
metric construction;
• Build a golden collection from the corpus LE-PAROLE;
• Calculate both precision and recall from that same corpus, containing frozen and non-frozen ex-
pressions. Precision corresponds to the proportion of positive identifications that are actually cor-
rect, that is, the proportion of sentences identified as frozen that are actually frozen. Recall that is
the proportion of actual positives that was identified correctly. Recall is already being calculated
from the set of frozen sentences used in this work, which means that there are no false negatives.
However, in case it would be calculated for a more diverse corpus, there would be false negatives,
and the results would be different;
• Indicate, in the final report containing the results of the validation, the reason for the failure, when
there is one;
• Write, in the matrix, the result of the evaluation and the cause for a failure, when there is one, in
order to automate the error detection, and avoiding manual validation. This will ease the error
correction process;
• During the next evaluation iteration, compare its results with the results from the current version,
and underline differences;
• Generate, in an automatic way, sentences that do not contain one of the frozen elements and,
therefore, is not fixed.
Processing a corpus, such as the European Portuguese annotated corpus built in the scope of the
project PARSEME, which contains both frozen and non-frozen expressions, would be the fact that the
frozen expressions would be challenging for the system. Frozen sentences may present themselves in
the most variate ways, while mixed with other expressions, and that would prove itself interesting to ex-
trinsically evaluate the system, especially given that up until now, the system has only been tested with
texts containing only frozen sentences, and therefore has only been intrisically evaluated. This would
allow for a more comprehensive understanding of how well the system would behave in the real world,
where sentences may not appear as it is expected, on may also appear in the form of transformations of
their base sentences.
In terms of what can be done in the developed code, all the conversions between a value of the matrix
and the XIP code should be described in a declarative way, such as a table, or a dictionary. Coordina-
tion could also be accepted for a constituent, as well as the POS PREDSUBJ and a prefix representing a
container (medidas, lit: ‘measures’).
58
References
[1] BAPTISTA, JORGE. 2005. Construções simétricas: argumentos e complementos. Pages 353–367 of:
FIGUEIREDO, O; RIO-TORTO, GRAÇA & SILVA, F. (eds), Volume de homenagem ao Prof. Mário Vilela.
Fac.Letras-U.Porto.
[2] BAPTISTA, JORGE & MAMEDE, NUNO. 2016. Nomenclature of chunks and dependencies in Portuguese
XIP Grammar 4.6. Technical Report. L2F-Spoken Language Laboratory, INESC-ID Lisboa, Lisboa.
[3] BAPTISTA, JORGE; CORREIA, ANABELA & FERNANDES, GRAÇA. Frozen Sentences of Portuguese:
Formal Descriptions for NLP. Pages 72–79 of: Workshop on Multiword Expressions: Integrating Process-
ing. Barcelona, Spain: ACL, for International Conference of the European Chapter of the Association
for Computational Linguistics.
[4] BAPTISTA, JORGE; FERNANDES, GRAÇA; TALHADAS, RUI; DIAS, FRANCISCO & MAMEDE, NUNO.
Implementing European Portuguese Verbal Idioms in a Natural Language Processing System. Pages
102 – 115 of: CORPAS PASTOR, G. (ED.) (ed), Computerised and Corpus-based Approaches to Phraseology:
Monolingual and Multilingual Perspectives/Fraseología computacional y basada en corpus: perspectivas mono-
lingües y multilingües, Proceedings of Conference of the European Society of Phraseology (EuroPhras 2015).
Málaga, Spain: Editions Tradulex, Geneva.
[5] BAPTISTA, JORGE; MAMEDE, NUNO & MARKOV., ILIA. 2014. Integrating verbal idioms into an
NLP system. Pages 251–256 of: BAPTISTA, JORGE; MAMEDE, NUNO; CANDEIAS, SARA; PARABONI,
IVANDRÉ; PARDO, THIAGO & DAS GRAÇAS VOLPE NUNES, MARIA (eds), Computational Processing of
the Portuguese Language. Lecture Notes in Computer Science / Lecture Notes in Artificial Intelligence,
vol. 8775. Berlin: Springer, for 11th International Conference PROPOR’2014, São Carlos – SP, Brazil,
October 8-10, 2014.
[6] CONSTANT, MATHIEU; ERYIGIT, GÜLSEN; MONTI, JOHANNA; VAN DER PLAS, LONNEKE;
RAMISCH, CARLOS; ROSNER, MICHAEL & TODIRASCU, AMALIA. 2017. Multiword Expression Pro-
cessing: A Survey. Computational Linguistics, 4, 837–892.
[7] DINIZ, CLÁUDIO. 2010. RuDriCo2 : Um Conversor Baseado em Regras de Transformação Declarativas.
Master thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa.
[8] DINIZ, CLÁUDIO; MAMEDE, NUNO & PEREIRA, JOÃO. 2010. RuDriCo2: A faster disambiguator
and segmentation modifier. Pages 573–584 of: Simpósio de Informática - INForum.
59
[9] GROSS, MAURICE. 1982. Une classification des phrases "figées" du français. Revue Québécoise de
Linguistique, 11(2), 151–185.
[10] GROSS, MAURICE. 1996. Lexicon-Grammar. Pages 244–259 of: BROWN, KEITH ; & MILLER, J. (eds),
Concise Encyclopedia of Syntactic Theories. Cambridge: Pergamon.
[11] MAMEDE, NUNO; BAPTISTA, JORGE; CABARRÃO, VERA & DINIZ, CLÁUDIO. 2012. STRING: An
Hybrid Statistical and Rule-based Natural Language Processing Chain for Portuguese. In: Interna-
tional Conference on Computational Processing of Portuguese (PROPOR 2012), vol. Demo Session.
[12] MARTINS, R. T.; HASEGAWA, R.; NUNES, M. G. V.; G. MONTILHA, G. & OLIVEIRA, O. N. 1998.
Linguistic issues in the development of REGRA: a grammar checker for Brazilian Portuguese. Natural
Language Engineering, 4(4), 287—307.
[13] MOKHTAR, SALAH AIT; CHANOD, JEAN-PIERRE & ROUX, CLAUDE. 2002. Robustness beyond
shalowness: incremental dependency parsing. Natural Language Engineering, 121–144.
[14] THE DOCUMENT COMPANY XEROX & XEROX RESEARCH CENTRE EUROPE. 2007a. Xerox Incremen-
tal Parser Reference Guide.
[15] THE DOCUMENT COMPANY XEROX & XEROX RESEARCH CENTRE EUROPE. 2007b. Xerox Incremen-
tal Parser User’s Guide.
[16] VICENTE, ALEXANDRE. 2013. LexMan: um Segmentador e Analisador Morfológico com Transdutores.
Master thesis, Instituto Superior Técnico, Universidade de Lisboa.
60
Appendix A
Conversion to XIP rules
In this annex the tables from Table A.1 to Table A.14 are presented, showing the restrictions imposed
by each class, its instantiation, as well as the corresponding XIP rule.
Table A.1: XIP Rule restrictions and instantiation for the class C1 and the example O João abanou o capacete
C1 - O João abanou o capacete lit: ‘João shaked the helmet’, ‘to
dance’
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=Nhum SUBJ(#2,#1) SUBJ(abanou, João)
N1=N-hum CDIR(#2,#3) CDIR(abanou, capacete)
Det1 DETD(#3,?) DETD(capacete, o)
The XIP Rule for the example of Table A.1 is:
if (VDOMAIN(#?,#2[lemma:abanar]) &
CDIR[post](#2,#3[surface:capacete]) &
DETD(#3,?[surface:o])
)
FIXED(#2,#3)
Table A.2: XIP Rule restrictions and instantiation for the class CDN and the example O Rui sondou a opinião da Inês
CDN - O Rui sondou a opinião da Inês lit: ‘Rui sounded Inês’
opinion’, ‘to try to find out one’s opinion’.
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=Nhum SUBJ(#2,#1) SUBJ(sondou,Rui))
N1=N-hum CDIR(#2,#3) CDIR(sondou,opinião)
N2=Nhum MOD[post](#3,#4) MOD[post](opinião,Inês)
Det1 DETD(#3,?) DETD(opinião,a)
61
The XIP Rule for the example of Table A.2 is:
if ( VDOMAIN(#?,#2[lemma:sondar]) &
CDIR[post](#2,#3[surface:opinião]) &
MOD[post](#3,#4[UMB-Human])&
PREPD(#4,?[surface:de])
)
FIXED(#2,#3)
Given that this sentence allows for the [PronPos] transformation, the rule becomes the following:
if ( VDOMAIN(#?,#2[lemma:sondar]) &
CDIR[post](#2,#3[surface:opinião]) &
DETD(#3,?[surface:a]) &
( ( MOD[post](#3,#4[UMB-Human]) &
PREPD(#4,?[surface:de]) )
|| POSS(#3,?) )
)
FIXED(#2, #3)
Table A.3: XIP Rule restrictions and instantiation for the class CAN and the example O João matou a fome do Pedro.
CAN - O João matou a fome do Pedro lit: ‘João killed Pedro’s
hunger’
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=Nhum SUBJ(#2,#1) SUBJ(matou,João)
N1=Nhum CDIR(#2,#3) CDIR(matou,fome)
N2=Nhum MOD[post](#3,#4) MOD[post](fome,Pedro)
Det1 DETD(#3,?) DETD(fome,a)
Prep2 PREPD(#4,?) PREPD(fome,de)
The XIP Rule for the example of Table A.3 is:
if ( VDOMAIN(#?,#2[lemma:matar]) &
CDIR[post](#2,#3[surface:fome]) &
DETD(#3,?[surface:a]) & MOD[post](#3,#4[UMB-Human]) &
PREPD(#4,?[surface:de])
)
FIXED(#2,#3)
Given that this sentence allows for the [PronPos] and [Rdat] transformation, the rule becomes the
following:
62
if ( VDOMAIN(#?,#2[lemma:matar]) &
CDIR[post](#2,#3[surface:fome]) &
DETD(#3,?[surface:a]) &
( ( ( MOD[post](#3,#4[UMB-Human]) &
PREPD(#4,?[surface:de]) )
|| CLITIC(#2,?[dat]) )
|| POSS(#3,?) )
)
FIXED(#2, #3)
Table A.4: XIP Rule restrictions and instantiation for the class CNP2 and the example O Rui cortou o problema pela
base.
CNP2 - O Rui cortou o problema pela base lit: ‘Rui cut the problem
at its root’
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=Nhum SUBJ(#2,#1) SUBJ(cortou,Rui)
N1=N-hum CDIR(#2,#3) CDIR(cortou,problema)
Det2 DETD(#4,?) DETD(base,a)
C2 MOD[post](#3,#4) MOD(base,problema)
Prep2 PREPD(#4,?) PREPD(base,por)
The XIP the example of Table A.4 is:
if ( VDOMAIN(#?,#2[lemma:cortar]) &
CDIR[post](#2,#3[UMB-Human: ])) &
MOD[post](#2,#4[surface:base]) &
PREPD(#4,?[surface:por]) &
DETD(#4,?[surface:a]) &
)
FIXED(#2,#4)
Given that this sentence allows for the [PronA], the rule becomes the following:
if ( VDOMAIN(#?,#2[lemma:cortar]) &
( ( CDIR[post](#2,#3[UMB-Human: ]) ) || CLITIC(#2,#3[acc]) ) ) &
MOD[post](#2,#4[surface:base]) &
PREPD(#4,?[surface:por]) &
DETD(#4,?[surface:a]) &
)
63
FIXED(#2,#4)
Table A.5: XIP Rule restrictions and instantiation for the class C1PN and the example A Rita afiou os dentes ao
dinheiro.
C1PN - A Rita afiou os dentes ao dinheiro lit: ‘Rita sharpened her
teeth to the money’ ‘to be greedy’
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=Nhum SUBJ(#2,#1) SUBJ(afiou,Rita)
N1=N-hum CDIR(#2,#3) CDIR(afiou,dentes)
N2=N-hum MOD[post](#3,#4) MOD[post](dentes,dinheiro)
Det1 DETD(#3,?) DETD(dentes,os)
The XIP Rule for the example of Table A.5 is:
if ( VDOMAIN(#?,#2[lemma:afiar]) &
CDIR[post](#2,#3[surface:dentes]) &
DETD(#3,?[surface:os]) &
MOD[post](#2,#4[UMB-Human: ]) &
PREPD(#4,?[surface:a])
)
FIXED(#2,#3)
Table A.6: XIP Rule restrictions and instantiation for the class C1P2 and the example O casaco custou os olhos da
cara do Rui.
C1P2 - O casaco custou os olhos da cara do Rui lit: ‘The coat
cost the eyes of Rui’s face’ ‘to be very expensive’
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=Nhum SUBJ(#2,#1) SUBJ(custou,casaco)
N1=N-hum CDIR(#2,#3) CDIR(custou,olhos)
Det1 DETD(#3,?) DETD(olhos,os)
Det2 DETD(#4,?) DETD(cara,a)
C2 MOD[post](#3,#4) MOD[post](olhos,cara)
Prep2 PREPD(#4,?) PREPD(cara,de)
The XIP Rule for the example of Table A.6 is:
if ( VDOMAIN(#?,#2[lemma:custar]) &
CDIR[post](#2,#4[surface:olhos]) &
64
DETD(#4,?[surface:os]) &
MOD[post](#3,#4[surface:cara]) &
PREPD(#5,?[surface:de]) &
DETD(#4,?[surface:a])
)
FIXED(#2,#3,#4)
Table A.7: XIP Rule restrictions and instantiation for the class CPPN and the example O João comprou gato por
lebre ao Pedro.
CPPN - O João comprou gato por lebre lit: ‘João bought cat for
hare’, ‘to be dupped’.
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=Nhum SUBJ(#2,#1) SUBJ(comprou,João)
N1=N-hum CDIR(#2,#3) CDIR(comprou,gato)
Det2 DETD(#4,?) DETD(lebre,por)
C2 MOD[post](#3,#4) MOD[post](gato,lebre)
Prep3 PREPD(#5,?[surface:a]) PREPD(Pedr,a)
N3 = Nhum MOD[post](#2,#5[UMB-Human]) MOD[post](comprou,Pedro)
The XIP Rule for the example of Table A.7 is:
if ( VDOMAIN(#?,#2[lemma:comprar]) &
CDIR[post](#2,#3[surface:gato]) &
MOD[post](#2,#4[surface:lebre]) &
PREPD(#4,?[surface:por]) &
MOD[post](#2,#5[UMB-Human]) &
PREPD(#5,?[surface:a])
)
FIXED(#2,#3,#4)
65
Table A.8: XIP Rule restrictions and instantiation for the class CPP and the example O Zé bate com o nariz na porta
lit: ‘Zé hit with his nose on the door’.
CPP - O Zé bate com o nariz na porta lit: ‘Zé hit with his nose on
the door’, ‘finding a place to be closed or not achieving something’
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=Nhum SUBJ(#2,#1) SUBJ(bate,Zé)
N1=N-hum CDIR(#2,#3) CDIR(bate,nariz)
Prep1 PREPD(#3,?) PREPD(nariz,com)
Det1 DETD(#3,?) DETD(nariz, o)
Prep2 PREPD(#4,?) PREPD(porta,em)
Det2 DETD(#4,?) DETD(porta,a)
C2 MOD[post](#3,#4) MOD[post](nariz, porta)
The XIP Rule for the example of Table A.8 is:
if ( VDOMAIN(#?,#2[lemma:bater]) &
MOD[post](#2,#3[surface:nariz]) &
PREPD(#3,?[surface:com]) &
DETD(#3,?[surface:o]) &
MOD[post](#2,#4[surface:porta]) &
PREPD(#4,?[surface:em]) &
DETD(#4,?[surface:a])
)
FIXED(#2,#3,#4)
Table A.9: XIP Rule restrictions and instantiation for the class CP1 and the exampleO Zé bateu em retirada.
CP1 - O Zé bateu em retirada lit: ‘Zé has withdrawn’ ‘to run away’
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=Nhum SUBJ(#2,#1) SUBJ(bateu,Zé)
Prep1 PREPD(#3,?) PREPD(retirada,em)
C1 MOD[post](#2,#3) MOD[post](bateu,retirada)
The XIP Rule for the example of Table A.9 is:
if ( VDOMAIN(#?,#2[lemma:bater]) &
MOD[post](#2,#3[surface:retirada]) &
PREPD(#3,?[surface:em])
)
66
FIXED(#2,#3)
Table A.10: XIP Rule restrictions and instantiation for the class CPN and the example O Zé desceu na consideração
da Ana.
CPN - O Zé desceu na consideração da Ana lit: ‘Zé went down on
Ana’s consideration’.
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=Nhum SUBJ(#2,#1) SUBJ(desceu,Zé)
N2=Nhum MOD[post](#3,#4) MOD[post](consideração,Ana)
Prep1 PREPD(#3,?) PREPD(consideração,em)
Det1 DETD(#3,?) DETD(consideração,a)
C1 MOD[post](#2,#3) MOD[post](desceu,consideração)
The XIP Rule for the example of Table A.10 is:
if ( VDOMAIN(#?,#2[lemma:descer]) &
MOD[post](#2,#3[surface:consideração]) &
PREPD(#3,?[surface:em]) &
DETD(#3,?[surface:a]) &
MOD[post](#3,#4[UMB-Human]) &
PREPD(#4,?[surface:de])
)
FIXED(#2,#3)
Given that this sentencew allows the transformation [PronD], the rule becomes:
if ( VDOMAIN(#?,#2[lemma:descer]) &
MOD[post](#2,#3[surface:consideração]) &
PREPD(#3,?[surface:em]) &
DETD(#3,?[surface:a]) &
( ( MOD[post](#3,#4[UMB-Human]) &
PREPD(#4,?[surface:de]) )
|| POSS(#3,?) )
)
FIXED(#2,#3)
67
Table A.11: XIP Rule restrictions and instantiation for the class C0 and the example A sorte bateu à porta do Pedro.
C0 - A sorte bateu à porta do Pedro lit: ‘Luck knocked on Pedro’s
door’.
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=N-hum SUBJ(#2,#1) SUBJ(bateu,sorte)
N2=Nhum MOD[post](#3,#4) MOD[post](bateu,Pedro)
Prep1 PREPD(#3,?) PREPD(porta,a)
Det1 DETD(#3,?) DETD(porta,a)
C1 MOD[post](#2,#3) MOD(bateu,porta)
The XIP Rule for the example of Table A.11 is:
if ( VDOMAIN(#?,#2[lemma:bater]) &
SUBJ(#2,#1[surface:sorte]) &
MOD[post](#2,#3[surface:porta]) &
PREPD(#3,?[surface:a]) &
DETD(#3,?[surface:a]) &
MOD[post](#3,#4[UMB-Human]) &
PREPD(#4,?[surface:de])
)
FIXED(#2,#1,#3)
This sentence allow for two transformations to be applied to it, [PronPos] and [PronD]. Considering
these two transformations the rule becomes:
if ( VDOMAIN(#?,#2[lemma:bater]) &
SUBJ(#2,#1[surface:sorte]) &
MOD[post](#2,#3[surface:porta]) &
PREPD(#3,?[surface:a]) &
DETD(#3,?[surface:a]) &
( ( ( MOD[post](#3,#4[UMB-Human]) &
PREPD(#4,?[surface:de]) ) &
|| CLITIC(#2,?[dat]) )
|| POSS(#3,?) )
)
FIXED(#2,#1,#3)
68
Table A.12: XIP Rule restrictions and instantiation for the class C0E and the example Vai pentear macacos!.
C0E - Vai pentear macacos! lit: ‘Go comb monkeys!’, ‘do not
bother me/anyone anymore’.
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N1=N-hum CDIR(#2,#3) CDIR(pentear,macacos)
Vc VLINK(#2,#3) VLINK(vai, pentear)
The XIP Rule for the example of Table A.12 is:
if ( VLINK(#2[lemma:ir],#3[lemma:pentear]) &
CDIR[post](#3,#4[surface:macacos])
)
FIXED(#2,#3,#4)
Table A.13: XIP Rule restrictions and instantiation for the class CADV and the example O Pedro não nasceu ontem.
CADV - O Pedro não nasceu ontem lit: ‘Pedro was not born yes-
terday’, ‘is not dumb’.
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=Nhum SUBJ(#2,#1) SUBJ(nasceu,Pedro)
NegObrig MOD[neg,pre](#2,#3) MOD[neg](nasceu,não)
Adv1 MOD[post](#2,#4[adv,surface:ontem]) MOD(nasceu,ontem)
The XIP Rule for the example of Table A.13 is:
if ( VDOMAIN(#?,#2[lemma:nascer]) &
MOD[neg,pre](#2,#3) &
MOD[post](#2,#4[adv,surface:ontem])
)
FIXED(#3, #2, #4)
Table A.14: XIP Rule restrictions and instantiation for the class CV and the example A resposta não se fez esperar.
CV - A resposta não se fez esperar lit: ‘The answer did not take
long to arrive’.
Matrix Column XIP Rule Restriction XIP Rule Restriction Instantiation
N0=N-hum SUBJ(#2,#1) SUBJ(fez,resposta)
NegObrig MOD[neg,pre](#2,?) MOD[neg](fez,não)
Vc VLINK(#2,#3) VLINK(fez,esperar)
Vse CLITIC(#2,?) CLITIC(esperar,se)
69
The XIP Rule for the example of Table A.14 is:
if ( VLINK(#2[lemma:fazer],#3[lemma:esperar]) &
MOD[neg,pre](#3,#4) &
CLITIC(#3,#5[ref])
)
FIXED(#4, #2, #3, #5)
70
Appendix B
Readme of the program
+-------------+
| XIPIFICATOR |
+-------------+
2014
2019
O QUE É?
================
Esta é uma aplicação Python que gera automaticamente regras que permitem detectar
expressões fixas. Também gera exemplos que contêm transformações aplicáveis
a determinadas frases. Por último, existe um validador que executa as frases
de exemplo para cada regra na STRING e, a dependência FIXED é extraída, verifica
se o resultado corresponde a um dos três critérios:
• A dependência FIXED foi extraída correctamente;
• A dependência FIXED foi extraída correctamente com o número de argumentos
esperados;
• A dependência FIXED foi extraída correctamente com os argumentos exactamente
iguais aos esperados.
COMO USAR?
================
Gerador de Regras:
python3 bin/xipificator.py -file=FixedExpressions-v9.13.xlsx -sheet=FINAL >
dependencyFPhrase.xip
Copiar as regras geradas para as dependências do XIP, substituindo as existentes
anteriormente:
cp dependencyFPhrase.xip ../xip/ptGram/DEPENDENCIES/
71
Processar as frases manualmente produzidas na STRING:
cat "examples/validate.txt"| ../xip/./string.sh -f -tr -indent -xml > normal.xml
Processar as frases automaticamente geradas, por transformação, na STRING:
cat "examples/examplesPronA.txt"| ../xip/./string.sh -f -tr -indent -xml >
generatedPronA.xml
cat "examples/examplesPronD.txt"| ../xip/./string.sh -f -tr -indent -xml >
generatedPronD.xml
cat "examples/examplesPronP.txt"| ../xip/./string.sh -f -tr -indent -xml >
generatedPronP.xml
cat "examples/examplesPronR.txt"| ../xip/./string.sh -f -tr -indent -xml >
generatedPronR.xml
cat "examples/examplesPassSer.txt"| ../xip/./string.sh -f -tr -indent -xml
> generatedPassSer.xml
cat "examples/examplesPassEstar.txt"| ../xip/./string.sh -f -tr -indent -xml
> generatedPassEstar.xml
cat "examples/examplesRdat.txt"| ../xip/./string.sh -f -tr -indent -xml > generatedRDat.xml
Validar os resultados obtidos STRING:
python3 bin/xipificator_validate.py
COMO CORRER TUDO SEQUENCIALMENTE?
=============================================
./executeXipificator.sh
ESTRUTURA DO FICHEIRO DE ENTRADA, XLSX OU CSV
=============================================
Os ficheiros de regras são compostos por um cabeçalho (primeira linha) e por
um conjunto de regras de expressões fixas, com uma expressão por linha.
Cada coluna contém um elemento da regra (uma flag, lema ou palavra, regra,
exemplo...).
O cabeçalho contém o nome da coluna. As colunas podem surgir em qualquer ordem,
nesse caso o padrão deve ser conhecido e identificado no código (ver matriz
patterns).
Usando nomes pré-definidos para cada coluna, a aplicação pode determinar o
padrão de colunas automaticamente (usando o parâmetro -pattern=AUTO).
De seguida mostram-se os nomes pré-definidos para cada coluna.
Sujeito e Verbo:
-----------
N0 = Nhum : O sujeito é um nome humano (marca flag);
N0 = N-hum : O sujeito (livre) é um nome não-humano (marca flag);
VSe : O verbo deve ser acompanhado por um clítico;
NegObrig : O verbo deve ser acompanhado por um advérbio negativo ou expressão negativa;
72
V : Verbo (superfície ou lema);
PrepLink : Preposição que liga o primeiro verbo da construção a um segundo.
Modificadores:
---------
$ é o número da dependência do modificador, começando em 1
C$ : Cabeça do chunk do modificador;
Prep$ : Preposição (Prep1 é a preposição de C1);
Det$ : Determinante;
Modif$E : Pré-modificador de C$;
Modif$D : Pós-modificador de C$;
Adj$ : Adjectivo que modifica C$;
N$ = Nhum : Palavra na cabeça do chunk $ é um nome humano;
N$ = N-hum : Palavra na cabeça do chunk $ é um nome não-humano;
AttachV$ : Por omissão um modificador N+1 tem dependência ao modificador anterior
N. Marcando esta célula (+) vai criar uma dependência ao Verbo em vez do
ao modificador anterior;
C$Manual : Regra manual do XIP para todo o modificador $. Sobrepõe-se à regra
gerada automaticamente. Útil nos casos de regras com representações excepcionais;
C$ModManual : Regra do XIP para o modificador de C$.
Colunas que indicam a possibilidade de aplicação de transformações à construção:
-------------------------------------------------------------
[PronR$] : Se marcada com ’+’ esta coluna indica que a frase livre N1 pode ser
reduzida a um pronome reflexo .(ex: "O Pedro entregou tudo nas mãos de
Deus" transforma-se em "O Pedro entrega-se nas mãos de Deus");
[PronD$] : Se marcada com ’+’ esta coluna indica que o complemento N$ é distribucionalmente
livre e pode ser reduzida a um pronome dativo (ex: "O Pedro tirou o chapéu
ao João" transforma-se em "O Pedro tirou-lhe ao João");
[PronA$] : A frase livre N$ pode ser reduzida a um pronome acusativo (ex: "O João
viu a Inês pelo canto do olho" transforma-se em "O João viu-a pelo canto
do olho");
[PronPos$] : Se marcada com ’+’ esta coluna indica que a frase preposicional "de
N$" pode ser reduzida a um pronome possessivo (ex: "O Zé fala nas costas
da Ana" transforma-se em "O Zé fala nas suas costas");
[Pass-ser] : Se marcada com ’+’ esta coluna indica que a frase pode ser passada para
a forma passiva, e o verbo copulativo aceite por esta forma é ser (ex:
"A imprensa abafou um escândalo" passa a ser "Um escândalo foi abafado
pela imprensa";
73
[Pass-estar] : Se marcada com ’+’ esta coluna indica que a frase pode ser passada para
a forma passiva, e o verbo copulativo aceite por esta forma é estar (ex:
"A imprensa abafou um escândalo" passa a ser "Um escândalo está abafado
pela imprensa");
[Rdat$] : Se marcada com ’+’, a operação Rdat aplica-se a complementos determinativos
de nome de_Nhum, reestruturando o constituinte maior de que de_Nhum é parte
em dois complementos, nomeadamente passando de_Nhum a a_Nhum e ligando-o
diretamente ao verbo. Este último pode então ser pronominalizado (a_Nhum=>-lhe)
(ex: "O Pedro come as papas na cabeça da Ana" passa a ser "O Pedro come-lhe
as papas na cabeça");
Sim$ : Se marcada com ’+’ esta coluna indica que dois constituintes desta construção
podem ser coordenados numa determinada posição sintáctica (sujeitos simétricos
ou complementos simétricos) e podem trocar de lugar, sem mudar o significado
global da frase (ex: "A Isabel juntou os trapinhos com o Luís" é igual
a "O Luís juntou os trapinhos com a Isabel").
Outras colunas:
----------
AllManual : Opcional. Código manual. Se marcado (+), o conteúdo da célula ‘Manual‘
contém o código para esta expressão;
Manual : Opcional. Código do XIP para esta expressão (se AllManual está marcado
como (+));
Expected : Opcional. Resultado esperado pela lista de dependências do XIP para
esta expressão;
Exemplo : Exemplo do uso desta expressão. Este exemplo é usando como frase de
teste no validador:
Falha : Usado para marcar a causa do erro de validação da expressão. Se vazio,
assume-se que não existe erro. Erro pode ser marcado com o padrão <CÓDIGO>:<PALAVRA
OU EXPRESSÃO ONDE OCORRE> Exemplo: ’ P:casa (entre nome e verbo) ’. Regras
para testar apenas expressões com código de erro podem ser geradas usando
o parâmetro -f (falha). Se marcado como ’?’ assume-se que o erro é ainda
desconhecido. Regras para expressões marcada com este código podem ser
geradas usando o parâmetro -d (dúvida);
Normalized : Um conjunto de predicados é emparelhado com um verbo genérico (ex: "bater
as botas" emparelha com "morrer").
SINTAXE DAS CÉLULAS
=====================
74
Lemas e Superfícies
-------------
Por omissão, uma palavra numa célula indica a superfície de uma palavra. Para
indicar um lema, a palavra deve ser rodeada por < >. Exemplo: palavra indica
superfície, <palavra> indica lema.
Pos:
---
Por omissão, cada coluna da regra tem um tipo de dependência ou POS pre-definida.
(Exemplo: Det1 gera determinantes, POS-MOD gera modificadores) A POS da palavra
ligada por essa dependência pode ser alterada usando um prefixo na palavra.
A POS pode ser definida de duas formas:
• <POS>, quando se aplica a qualquer palavra naquela entrada
• POS:abc, quando se aplica à palavra ’abc’
Exemplo 1: Para indicar que a entrada tem um determinante, deve-se colocar
<DET>.
Exemplo 2: Para indicar que uma palavra ’abc’ é um adjectivo, deve-se colocar
o prefixo A, ficando a entrada na forma A:abc. No caso de ser um lema a entrada
fica na forma A:<abc>.
São definidas no código, na matriz pos que pode ser personalizada com mais
entradas.
Por omissão, o xipificator tem os seguintes POS:
• DET+POS: Determinante e Pronome Possessivo
• PRON+POS: Pronome Possessivo
• PRON+PES: Pronome Pessoal
• DET+DEM: Determinante e Pronome Demonstrativo
• ADV: Advérbio
• A: Adjectivo
• DET: Determinante
• POSDET: Determinante posterior
• Q: Ordinal, Cardinal ou Quantidade
• PREP: Preposição
• CONJ: Conjunção
Flags:
----
As flags permitem adicionar traços específicos a cada entrada da regra. Estas
são definidas depois de um Pos na forma <POS:FLAGS>
Exemplo: um determinante e pronome possessivo no feminino singular é escrito
na forma <DET+POS:fs>
São definidas no código, na matriz flags que pode ser personalizada com mais
75
entradas.
Por omissão, o xipificator tem as seguintes flags:
• s: singular
• p: plural
• m: masculino
• f: feminino
• O: oblíquo (pronome)
Prefixos:
------
Os prefixos permitem mudar o tipo de ligação de dependência e/ou adicionar
traços especiais às palavras.
Exemplo: ’retomar’ deve ser indicada como a palavra (verbo) ’tomar’ com um
prefixo (re-). Para forçar a existência desse prefixo a entrada deve ficar
na forma PFX:<tomar>
São definidas no código, na matriz prefix que pode ser personalizada com mais
entradas.
Por omissão, o xipificador tem os seguintes prefixos:
• MOD: Modificador;
• CDIR: Complemento Directo;
• CIND: Complemento Indirecto;
• FOC: Modificador com Foco;
• PREDSUBJ: Predicativo do Sujeito;
• PFX: Palavra com prefixo.
ESTRUTURA DAS REGRAS GERADAS
============================
O sujeito é marcado como a dependência #1 dentro da regra. O verbo é marcado
como a dependência #2. Os modificadores seguintes são marcados como #3, #4,
etc...
A representação do sujeito é determinada por:
• Um sujeito qualquer com as flags de HUM e/ou N-HUM, conforme indicado nas
regras;
• Um pronome pessoal, usando a dependência SUBJ(?,?[pers]);
• Um pronome relativo, usando a dependência QBOUNDARY.
O verbo é definido pela sua dependência VDOMAIN. Os modificadores podem marcados
com uma dependência CDIR se não tiver preposição ou MOD se tiver preposição.
Determinantes, adjectivos e pré/pós-modificadores são igualmente ligados à
cabeça da dependência.
Caso um dos modificadores contenha uma célula vazia, a existência de uma dependência
76
é aceite mas é opcional.
Caso um dos modificadores contenha a entrada <E> (empty), considera-se que
não existe de uma dependência deste tipo.
Neste último caso, a regra nega a existência de modificadores com MOD(?,?).
HOW-TO
======
ADICIONAR UM NOVO POS OU PREFIXO?
----------------------
1. Procurar a declaração da matrix pos ou prefix em xipificator.py;
2. Adicionar um nova linha à matriz;
3. A primeira coluna é o código da POS, a segunda coluna são os traços que
que a entrada vai ter (deixar em branco em caso de dúvida) e a terceira
coluna é uma ligação de dependência (por omissão vai ser um MOD[post] ou
CDIR[post]) no caso de esta ter de ser alterada;
4. Agora é possível criar entradas da forma POS:palavra ou <POS:flags> ou
PREFIXO:palavra.
ADICIONAR UMA NOVA DEPENDÊNCIA?
---------------------
1. Procurar a declaração da matrix flags ou pos;
2. Adicionar uma linha que relacione uma flag ou POS para uma ligação de dependência
(DEPTAG).
CRIAR O MEU PRÓPRIO PADRÃO DE COLUNAS NO XLSX?
-------------------------------
1. Procurar a declaração da matrix patterns;
2. Copiar a linha para o padrão AUTO e renomear para NAME;
3. For each entry (represented in the top comment) add the index of the column
inside the XSLX file; if V is in column H then add col(’H’) to the entry
for the VERB;
4. Use it passing the argument -pattern=NAME.
77
ADICIONAR UMA COLUNA AO PADRÃO AUTOMÁTICO?
----------------------------
1. Adicionar uma nova contante à lista indicada em #dependency structure com
o formato _NOVA => id diferente de todos os outro marcados;
2. Incrementar o valor da contante DEPENDENCYSIZE em 1 unidade;
3. Na rotina writeDependency adicionar uma chamada para a nova dependência
NOVA_DEPENDENCIA. Exemplo:
(id, fixed, expected) = printDepLink(lineno, ’NOVA_DEPENDENCIA’, pattern,
arr, base, _NOVA, prvid, id, fixed, expected, 0);
4. Na rotina guessPattern, adicionar no else-if em cadeia mais uma entrada
que indique o nome da coluna ’NOVACOLUNA’ a adicionar em que i é o número
do modificador. Exemplo:
elif (str == ("NOVACOLUNA" + i)) pattern[DEPENDENCY1 + (i-1)*DEPENDENCYSIZE
+ _NOVA] = position;
5. Criar a coluna no ficheiro de entrada;
6. A sintaxe será igual a qualquer outro modificador (poderá ser necessário
o uso de prefixos).
LIGAR UM MODIFICADOR AO VERBO EM VEZ DO MODIFICADOR ANTERIOR?
-----------------------------------------
1. Criar a coluna AttachV$ ($ é o número do modificador) caso não exista;
2. Marcar a coluna AttachV$ com um +.
INDICAR QUE UM MODIFICAR É UM COMPLEMENTO INDIRECTO?
-----------------------------------------
1. Colocar CINDIR:palavra na C$ do modificador.
DEFINIR UMA PREPOSIÇÃO PARA O MODIFICADOR SEQUINTE QUANDO ESTE (MOD) NÃO É
CONHECIDO?
---------------------------------------------------------
1. Criar mais um conjunto de colunas para o modificador seguintes;
2. Preencher a coluna da preposição;
3. Deixar C$ em branco ou indicar o seu <POS>.
78
MUDAR O CABEÇALHO DO FICHEIRO DE REGRAS?
---------------------------
1. Alterar a rotina writeHeader no final de xipificator_aux_functions.pl.
79