building a corpus for learning how to produce atonal pronouns in the romanian clitic sequence
TRANSCRIPT
-
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
1/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Building a corpus for learning how to produce
atonal pronouns in the Romanian clitic
sequence
Ciprian-Virgil Gerstenberger
Universitetet i Troms, Norge
Learner Language, Learner Corpora Conference
LLLC 201206.10.2012 Oulu, Finnland
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
2/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Outline
Atonal pronouns: Why a special corpus?
Language knowledge: How to build it?
Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
3/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Outline
Atonal pronouns: Why a special corpus?
Language knowledge: How to build it?
Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
4/46.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Outline
Atonal pronouns: Why a special corpus?
Language knowledge: How to build it?
Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
5/46.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
General question
How to deal with soft constraints in language production?
free word order (e.g., in Finnish) information structure, style?
in-situ vs. extraposed relative clauses (e.g., in German)
clause weight, registrer?
optional sandhi phenomena (e.g., in Romanian)
genre, register, dialect, sociolect, idiolect?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
6/46.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Specific question
What triggers optional realizationsof Romanian atonalpronouns?
(1) a. Te rog sa l faci! [Please, do it!]b. Te rog sa-l faci!
(2) a. Stiu ca i scrii emailuri. [I know that you write him/her emails.]
b. Stiu ca-i scrii emailuri.
(3) a. Hai sa ne apucam de treaba! [Lets start working!]
b. Hai sa ne-apucam de treaba!
? ? ?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
7/46.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
(External) SandhiJoining
Epenthesis in English: acar vs. anold car
Elision in French: lafille[the girl] vs. lglise[the church]
Elision in Romanian: Tulvezi. vs. Tu-lvezi.[You see him/it.]
Sandhi can be marked graphically but it doesnt have to.
Elision in Romanian is always graphically marked !
At l Wh i l ? L k l d H t b ild it? L d ti Wh t th b fit ?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
8/46.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Sandhi in RomanianGeneral Rule: avoid hiatus
CV VC
C-VC
M aapuc de treab a. [I start working.]
M-apuc de treab a.
CV-C
Tulvezi. [You see him/it.]
Tu-lvezi.
CV
-VC
Teapuci de treab a. [You start working.]
Te-apuci de treab a.
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
9/46...
........
. .......
........
. .......
........
. .......
........
. .......
. .......
.....
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Romanian atonal pronounsAccusative
Number Person Type Gender Syllabic Non-syllabic
onset coda
Sg 1. pers/refl m/f [m@] ma [m] m-
2. pers/refl m/f [te] te [te
] te-
3. pers m [l] l- [l] -l/l
f [o] o [o
] o-
relf m/f [se] se [s] s-, [se
] se-
Pl 1. pers/refl m/f [ne] ne [ne
] ne-
2. pers/refl m/f [v@] va [v] v- 3. pers m [i
] i- [j] -i/i
f [le] le [le
] le-
relf m/f [se] se [s] s-, [se
] se-
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
10/46...
........
. .......
........
. .......
........
. .......
........
. .......
. .......
.....
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Romanian atonal pronounsDative
Number Person Type Syllabic Non-syllabic
onset coda
Sg 1. pers/refl [mi] mi [mi] mi- [mj] -mi/mi
2. pers/refl [tsi] ti [tsi] ti- [tsj] -ti/ti
3. pers [i] i [i] i- [j] -i/i
relf [Si] si [Si] si- [Sj] -si/si
Pl 1. pers/refl [ni] ni, [ne] ne [ne
] ne-
2. pers/refl [vi] vi, [v@] va [v] v-
3. pers [li] li, [le] le [le
] le-
relf [Si] si [Si] si- [Sj] -si/si
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
11/46...
........
. .......
........
. .......
........
. .......
........
. .......
. .......
.....
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Problems from a learners perspectiveObligatory sandhi
atonal pronouns
*M-am apucat de treab a. [Ive started to work.]
*M aam apucat de treab a.
elsewhere
*ntr-un vis de var a [in a summer dream]
*ntreun vis de var a
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
12/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Problems from a learners perspectiveOptional sandhi
atonal pronouns
M-apuc de treab a. [I start to work.]
M aapuc de treab a.
elsewhere
Os-aduc cartea. [Ill bring the book.]
Os aaduc cartea.
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
13/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Problems from a learners perspectiveHyphennated non-reduced (=syllabic) forms
as phonological hosts
Til cumperi. [You buy it (for yourself).]
S a numiti pierzi timpul cu asa ceva! [Dont loose you time with such things.]
in postverbal position
Duteacas a! [Go home!]
as phonological hosts in postverbal positionCump ar atil ! [Buy it!]
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
14/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .p y p p . . . . . . . . . . . .g g g . . . . . . .g g p
Problems from a learners perspectiveUnderstanding: What kind of hyphen is it?
hyphen as unreliable indicator for reduced forms
Tiai cump arat cartea. [Youve bought the book!]
Til cumperi. [You buy it.]
Tiocumperi. [You buy it.]
S ati cumperi cartea! [Buy the book!]
Duteacas a! [Go home!]
Duteacas a! [Go home!]
Cump ar atil! [Buy it (for yourself)!]
Cump ar a--l! [Buy it!] Cump ar a--ti cartea! [Buy the book!]
gray = syllabic atonal pronoun black = reduced atonal pronoun
non-syllabic post-verbal --non-syllabic AN D post-verbal
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
15/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . .p y p p . . . . . . . . . . . .g g g . . . . . . .g g p
Problems from a learners perspectiveUnderstanding: Which phonological form is it?
grapheme-phoneme ambiguity
Cump ar a-ti-l![Buy it!] /Ti-l cumperi.[You buy it!] [tsi]
Cump ar a-ticartea![Buy the book!] /ticumperi cartea.[You buy the book.] [tsj]
Ti-ai cump arat cartea.[You
ve bought the book.
] [tsi]
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
16/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problems from a learners perspectiveProduction: To hyphenate or not to hyphenate?
obligatory or optional hyphenation?
if optional, reduced or non-reduced form?
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
17/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problems from a learners perspectiveProduction: To hyphenate or not to hyphenate?
obligatory or optional hyphenation?
if optional, reduced or non-reduced form?
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
18/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The choice issueWell-balanced mixture of jointed vs. non-jointed forms
defining well-balanceness?
domain of well-balanceness: clause, sentence, paragraph, text?
counting only optional or both obligatory and optional instances?
alignment, parallelity?
Trebuie s-ofaci si s-odregi! [You have to do it and to mend it!] Trebuie s aofaci si s aodregi!
Trebuie s aofaci si s-odregi!
Trebuie s-ofaci si s aodregi!
Different rhythm! A matter of style?
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
19/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The choice issueSpeech rate
Alexandra Popescu (2003) Morphophonologische Phnomene des
Rumnischen, PhD thesis, University of Dsseldorf, 2003
Optimality-Theoretic model:
reduced forms always win in faster speech rate non-reduced forms always win in normal speech rate
Popescu (2003), Ex. (21), p. 160
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
20/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The choice issueSpeech rate (cont.)
Alexandra Popescu (2003) Morphophonologische Phnomene des
Rumnischen, PhD thesis, University of Dsseldorf, 2003
speech rate is relative: no experimental setup
speech rate vs. number of syllable per time unit? what about rhythm?
Emil Boc, du-te-acas a/ Si apuc a-te de coas a!
Emil Boc, go home/ And start scything!
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
21/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The choice issueSpeech rate (cont.)
Alexandra Popescu (2003) Morphophonologische Phnomene des
Rumnischen, PhD thesis, University of Dsseldorf, 2003: (p. 179)
the OT model fails to account for all presented data
Es ist allerdings unklar, warum der Kandidat mit dem Vollvokal [1] neben dem
Kandidaten c. mit dem Vollvokal [i] beim Normalsprechen gewinnen kann, obwohl er
nach dem bisherigen Ranking schlechter ist als der Kandidat mit dem Vollvokal.
Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
22/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The choice issueMode, register, style
Maria Iliescu (1975) Pentru o sistematizare a pred arii pronumelui personal
neaccentuat romnesc (la studentii str aini), In Limba Romna 24, 1975
n limba literar a ngrijit a se prefer a proume nelegatein well-groomed literary style, non-bound pronouns are preferred
n stilul beletristic formele enlitice apar mai desin beletristic style, enclitic forms occur more often
fuzzy formulations: "are prefered", "occur more often" how to define well-groomedness?
how many styles to define?
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
23/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Usage-based approachCorpus-driven solution
ObservationRealization of some optional reduced atonal pronouns occur far moreoften than their non-reduced counterparts.
Jrgen Bredemeier (1976) Strukturbeschrnkungen im Rumnischen. Studien zur
Syntax der pr- und postverbalen Pronomina, TBL Verlag Gunter Narr, 1976
Why?
How often? Look into relevant data!
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
24/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Web as Corpus?
"Du-te-acas a!"
No fine-tuning possible!
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
25/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Web as Corpus
offering a wide range of usage-based instances of everything
improvements (e.g., sematic web) are not (yet) useful for thecurrent research issue
even simple but relevant distinctions are not possible without amassive data cleanup (diacritica, hypens, misspellings, sloppyformulations, etc.)
Far too expensive at the moment!
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
26/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Use existing CorporaOdense Grammatically Annotated Corpus of Romanian Business
Revista pe care ati realizat-omi-aatras atentia
annotation and preprocessing changed the original string
lacking atonal pronouns and auxiliaries, dangling hyphens
Not of much use!
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
27/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
What to do?
Build a special corpus!
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
28/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
General ideas
account for specific phenomena (encountered instanced plus alloptional variants)
provide additional necessary linguistic annotation(part-of-speech)
add accessible, relevant infos (spoken, written, genre, etc.)
enable unification of specific annotated data with other layers(syntax, semantics, information structure)
keep the original string on place use as much as possible copyright-free data
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
29/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Experimental data setEuroparl Corpus
Romanian part of the Europart Corpus
parallel corpus extracted from the proceedings of the European
Parliament
original purpose: Statistical Machine Translation (SMT)
freely available
compared to Google data, much cleaner
yet, still a huge amount of cleanup work
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
30/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Data evaluation
size after the first cleaned up and broken into sentences usingthe default tools224417 inc_europarl_ro.sent.txt
size after cleanup foreign sentences and diacritica correction
223622 europarl_ro.sent.xml pseudo-senteces, formulaic senteces (parliament meetings)
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
31/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Usable data for the research question
search for lines with at least a hyphen56155
unique instances53897
Filter irrelevant hypen occurences!
Search for the non-reduced pronominal forms!
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
32/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Language knowledgeThe small universe of atonal pronouns in Romanian
local phenomenon
relatively small number of forms
modelling any possible combination (even non-grammaticalones aka mal rules in error modelling)
exhausitve modelling
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
33/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Language knowledgeExample: 1pers, Sg, Acc
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/http://goback/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
34/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Annotation runCurrent state
pattern + context-testing functions current annotation state
add all other optional forms licensed by the given context
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
35/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Annotation runIntended state
Part-of-speech information needed!
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
36/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Part-of-Speech annotationCurrent state
whole corpus pos-tagged using
http://www.racai.ro/webservices/TextProcessing.aspx
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://www.racai.ro/webservices/TextProcessing.aspxhttp://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
37/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Towards the final formatSteps to do
transform the MULTEX pos annotation into an xml format
unify the annotation of optional sandhi with the pos annotation
... and then?
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
38/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
... starts the real linguistic fun!Using the whole potential of the linguistic annotation
Is there a significant difference between the occurences ofs a mi vs. s a-mi and, e.g., c a ti vs. c a-ti?
taking more context into account (item before subjunction + itemafter the atonal pronoun) and count the syllable of the extendedcontext?
What about the rhythm changes in the context (cf. the hugeamount and variation of reduced forms in the Romanian poetry)?
include stylometric measurements
What triggers the choice of a specific surface form?
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
39/46
.......
..... ...
........
..... ...
........
..... ...
........
..... ...
..... ...
........
.
Extending the linguistic playgroundAnnotating more (copyright-free) data
Romanian part of the JRC-Acquis Multilingual Parallel Corpus
DEX Dictionarul Explicativ al Limbii Romne
Romanian Wikipedia
articles (elaborated, well-formulated text) comments (informal, more personal)
Copyright-free data is shareable data!
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
N l L G i L L i
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
40/46
... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .
Natural Language Generation vs. Language Learning
sharing the need to produce well-formed, situationally adequatenatural language utterances
Why not sharing the knowledge as well?
Why not the resources, too?
Sharing data is not like sharing a slice of bread,
rather like Jesus bread and fish miracle!
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
M hi h
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
41/46
... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .
Machine vs. humanTransferability of constraint formulation from NLG to LL
Is the constraint formatization from NLG transferable to the LLdomain?
Yes! Linearization and surface realization have to be applied onperceivable entities.
no room to generate partially empty strings
no room to linearize traces or empty categories
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
E l f t i t f l ti i NLG
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
42/46
... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .
Example of constraint formulation in NLGObligatory sandhy in the sequence of atonal pronouns
Rule: The rightmost item in the atonal pronoun sequence can not be an open syllablewith nucleus [i].
Assuming the base form [ni] ni:Is it the rightmost atonal pron in the sequence?
1. yes change from [ni] ni to [ne] neIs there on the left an item to obligatorily attach to?
1.1 yes (e.g., [ne] ne[a] a dat)
attach [ne
a] ne-a dat1.2 no (e.g., [ne] ne[dai
] dai)
done [ne dai] ne dai2. no (e.g., [ni] ni [le] le[dai
] dai)
done [ni le dai] ni le dai
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
O ti l dhi h
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
43/46
... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .
Optional sandhi phenomenaExploting the specific language model
analyse the context
consult the specific language model
give hints to students wrt. most appropriate form to choose
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Further possible applications
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
44/46
... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .
Further possible applicationsExploiting specific language resources
design and implementation of different types of languagelearning exercises for training atonal pronouns
specific feedback to production error types because of mal-rulelike coding of non-licensed forms
enriching existing analysis tools (parsers) with specificinformation
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Human vs machine
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
45/46
... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .
Human vs. machineNLG too much of a technique, too little of a science
Using NLG techniques for LL: rara avis
Karin Harbusch et Al (2009) Computing Accurate Grammatical Feedback in a Virtual
Writing Conference for German-Speaking Elementary-School Children: An Approach
Based on Natural Language Generation, CALICO Journal, 26(3), 2009
Using LL research insights for NLG
NLG too much of a Fiat!-domain: from the very beginning
NLG paying very little attention to surface phenomena such as
language variation or even orthography
modelling human language production: a real plus for NLG
. . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?
Conclusions
http://find/ -
7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence
46/46
... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .
Conclusions
motivating the need for special corpora for learning how to makedecisions in case of optional surface realization
reporting on the cumbersome process of building resources forspecial phenomena
stressing the need of resource and insights sharing betweenfields with similar goals
underlining the benefits of sharing resource between NLG andLL wrt. realization of atonal pronouns in Romanian
Share resources!
http://find/http://goback/