building a corpus for learning how to produce atonal pronouns in the romanian clitic sequence

Upload: lunorip

Post on 03-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    1/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Building a corpus for learning how to produce

    atonal pronouns in the Romanian clitic

    sequence

    Ciprian-Virgil Gerstenberger

    Universitetet i Troms, Norge

    Learner Language, Learner Corpora Conference

    LLLC 201206.10.2012 Oulu, Finnland

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    2/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Outline

    Atonal pronouns: Why a special corpus?

    Language knowledge: How to build it?

    Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    3/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Outline

    Atonal pronouns: Why a special corpus?

    Language knowledge: How to build it?

    Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    4/46.......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Outline

    Atonal pronouns: Why a special corpus?

    Language knowledge: How to build it?

    Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    5/46.......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    General question

    How to deal with soft constraints in language production?

    free word order (e.g., in Finnish) information structure, style?

    in-situ vs. extraposed relative clauses (e.g., in German)

    clause weight, registrer?

    optional sandhi phenomena (e.g., in Romanian)

    genre, register, dialect, sociolect, idiolect?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    6/46.......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Specific question

    What triggers optional realizationsof Romanian atonalpronouns?

    (1) a. Te rog sa l faci! [Please, do it!]b. Te rog sa-l faci!

    (2) a. Stiu ca i scrii emailuri. [I know that you write him/her emails.]

    b. Stiu ca-i scrii emailuri.

    (3) a. Hai sa ne apucam de treaba! [Lets start working!]

    b. Hai sa ne-apucam de treaba!

    ? ? ?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    7/46.......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    (External) SandhiJoining

    Epenthesis in English: acar vs. anold car

    Elision in French: lafille[the girl] vs. lglise[the church]

    Elision in Romanian: Tulvezi. vs. Tu-lvezi.[You see him/it.]

    Sandhi can be marked graphically but it doesnt have to.

    Elision in Romanian is always graphically marked !

    At l Wh i l ? L k l d H t b ild it? L d ti Wh t th b fit ?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    8/46.......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Sandhi in RomanianGeneral Rule: avoid hiatus

    CV VC

    C-VC

    M aapuc de treab a. [I start working.]

    M-apuc de treab a.

    CV-C

    Tulvezi. [You see him/it.]

    Tu-lvezi.

    CV

    -VC

    Teapuci de treab a. [You start working.]

    Te-apuci de treab a.

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    9/46...

    ........

    . .......

    ........

    . .......

    ........

    . .......

    ........

    . .......

    . .......

    .....

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Romanian atonal pronounsAccusative

    Number Person Type Gender Syllabic Non-syllabic

    onset coda

    Sg 1. pers/refl m/f [m@] ma [m] m-

    2. pers/refl m/f [te] te [te

    ] te-

    3. pers m [l] l- [l] -l/l

    f [o] o [o

    ] o-

    relf m/f [se] se [s] s-, [se

    ] se-

    Pl 1. pers/refl m/f [ne] ne [ne

    ] ne-

    2. pers/refl m/f [v@] va [v] v- 3. pers m [i

    ] i- [j] -i/i

    f [le] le [le

    ] le-

    relf m/f [se] se [s] s-, [se

    ] se-

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    10/46...

    ........

    . .......

    ........

    . .......

    ........

    . .......

    ........

    . .......

    . .......

    .....

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Romanian atonal pronounsDative

    Number Person Type Syllabic Non-syllabic

    onset coda

    Sg 1. pers/refl [mi] mi [mi] mi- [mj] -mi/mi

    2. pers/refl [tsi] ti [tsi] ti- [tsj] -ti/ti

    3. pers [i] i [i] i- [j] -i/i

    relf [Si] si [Si] si- [Sj] -si/si

    Pl 1. pers/refl [ni] ni, [ne] ne [ne

    ] ne-

    2. pers/refl [vi] vi, [v@] va [v] v-

    3. pers [li] li, [le] le [le

    ] le-

    relf [Si] si [Si] si- [Sj] -si/si

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    11/46...

    ........

    . .......

    ........

    . .......

    ........

    . .......

    ........

    . .......

    . .......

    .....

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Problems from a learners perspectiveObligatory sandhi

    atonal pronouns

    *M-am apucat de treab a. [Ive started to work.]

    *M aam apucat de treab a.

    elsewhere

    *ntr-un vis de var a [in a summer dream]

    *ntreun vis de var a

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    12/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Problems from a learners perspectiveOptional sandhi

    atonal pronouns

    M-apuc de treab a. [I start to work.]

    M aapuc de treab a.

    elsewhere

    Os-aduc cartea. [Ill bring the book.]

    Os aaduc cartea.

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    13/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Problems from a learners perspectiveHyphennated non-reduced (=syllabic) forms

    as phonological hosts

    Til cumperi. [You buy it (for yourself).]

    S a numiti pierzi timpul cu asa ceva! [Dont loose you time with such things.]

    in postverbal position

    Duteacas a! [Go home!]

    as phonological hosts in postverbal positionCump ar atil ! [Buy it!]

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    14/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .p y p p . . . . . . . . . . . .g g g . . . . . . .g g p

    Problems from a learners perspectiveUnderstanding: What kind of hyphen is it?

    hyphen as unreliable indicator for reduced forms

    Tiai cump arat cartea. [Youve bought the book!]

    Til cumperi. [You buy it.]

    Tiocumperi. [You buy it.]

    S ati cumperi cartea! [Buy the book!]

    Duteacas a! [Go home!]

    Duteacas a! [Go home!]

    Cump ar atil! [Buy it (for yourself)!]

    Cump ar a--l! [Buy it!] Cump ar a--ti cartea! [Buy the book!]

    gray = syllabic atonal pronoun black = reduced atonal pronoun

    non-syllabic post-verbal --non-syllabic AN D post-verbal

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    15/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . .p y p p . . . . . . . . . . . .g g g . . . . . . .g g p

    Problems from a learners perspectiveUnderstanding: Which phonological form is it?

    grapheme-phoneme ambiguity

    Cump ar a-ti-l![Buy it!] /Ti-l cumperi.[You buy it!] [tsi]

    Cump ar a-ticartea![Buy the book!] /ticumperi cartea.[You buy the book.] [tsj]

    Ti-ai cump arat cartea.[You

    ve bought the book.

    ] [tsi]

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    16/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Problems from a learners perspectiveProduction: To hyphenate or not to hyphenate?

    obligatory or optional hyphenation?

    if optional, reduced or non-reduced form?

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    17/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    Problems from a learners perspectiveProduction: To hyphenate or not to hyphenate?

    obligatory or optional hyphenation?

    if optional, reduced or non-reduced form?

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    18/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    The choice issueWell-balanced mixture of jointed vs. non-jointed forms

    defining well-balanceness?

    domain of well-balanceness: clause, sentence, paragraph, text?

    counting only optional or both obligatory and optional instances?

    alignment, parallelity?

    Trebuie s-ofaci si s-odregi! [You have to do it and to mend it!] Trebuie s aofaci si s aodregi!

    Trebuie s aofaci si s-odregi!

    Trebuie s-ofaci si s aodregi!

    Different rhythm! A matter of style?

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    19/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    The choice issueSpeech rate

    Alexandra Popescu (2003) Morphophonologische Phnomene des

    Rumnischen, PhD thesis, University of Dsseldorf, 2003

    Optimality-Theoretic model:

    reduced forms always win in faster speech rate non-reduced forms always win in normal speech rate

    Popescu (2003), Ex. (21), p. 160

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    20/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    The choice issueSpeech rate (cont.)

    Alexandra Popescu (2003) Morphophonologische Phnomene des

    Rumnischen, PhD thesis, University of Dsseldorf, 2003

    speech rate is relative: no experimental setup

    speech rate vs. number of syllable per time unit? what about rhythm?

    Emil Boc, du-te-acas a/ Si apuc a-te de coas a!

    Emil Boc, go home/ And start scything!

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    21/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    The choice issueSpeech rate (cont.)

    Alexandra Popescu (2003) Morphophonologische Phnomene des

    Rumnischen, PhD thesis, University of Dsseldorf, 2003: (p. 179)

    the OT model fails to account for all presented data

    Es ist allerdings unklar, warum der Kandidat mit dem Vollvokal [1] neben dem

    Kandidaten c. mit dem Vollvokal [i] beim Normalsprechen gewinnen kann, obwohl er

    nach dem bisherigen Ranking schlechter ist als der Kandidat mit dem Vollvokal.

    Atonal pronouns: Why a special corpus? Language knowledge: How to build it? Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    22/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

    The choice issueMode, register, style

    Maria Iliescu (1975) Pentru o sistematizare a pred arii pronumelui personal

    neaccentuat romnesc (la studentii str aini), In Limba Romna 24, 1975

    n limba literar a ngrijit a se prefer a proume nelegatein well-groomed literary style, non-bound pronouns are preferred

    n stilul beletristic formele enlitice apar mai desin beletristic style, enclitic forms occur more often

    fuzzy formulations: "are prefered", "occur more often" how to define well-groomedness?

    how many styles to define?

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    23/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Usage-based approachCorpus-driven solution

    ObservationRealization of some optional reduced atonal pronouns occur far moreoften than their non-reduced counterparts.

    Jrgen Bredemeier (1976) Strukturbeschrnkungen im Rumnischen. Studien zur

    Syntax der pr- und postverbalen Pronomina, TBL Verlag Gunter Narr, 1976

    Why?

    How often? Look into relevant data!

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    24/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Web as Corpus?

    "Du-te-acas a!"

    No fine-tuning possible!

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    25/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Web as Corpus

    offering a wide range of usage-based instances of everything

    improvements (e.g., sematic web) are not (yet) useful for thecurrent research issue

    even simple but relevant distinctions are not possible without amassive data cleanup (diacritica, hypens, misspellings, sloppyformulations, etc.)

    Far too expensive at the moment!

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    26/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Use existing CorporaOdense Grammatically Annotated Corpus of Romanian Business

    Revista pe care ati realizat-omi-aatras atentia

    annotation and preprocessing changed the original string

    lacking atonal pronouns and auxiliaries, dangling hyphens

    Not of much use!

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    27/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    What to do?

    Build a special corpus!

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    28/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    General ideas

    account for specific phenomena (encountered instanced plus alloptional variants)

    provide additional necessary linguistic annotation(part-of-speech)

    add accessible, relevant infos (spoken, written, genre, etc.)

    enable unification of specific annotated data with other layers(syntax, semantics, information structure)

    keep the original string on place use as much as possible copyright-free data

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    29/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Experimental data setEuroparl Corpus

    Romanian part of the Europart Corpus

    parallel corpus extracted from the proceedings of the European

    Parliament

    original purpose: Statistical Machine Translation (SMT)

    freely available

    compared to Google data, much cleaner

    yet, still a huge amount of cleanup work

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    30/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Data evaluation

    size after the first cleaned up and broken into sentences usingthe default tools224417 inc_europarl_ro.sent.txt

    size after cleanup foreign sentences and diacritica correction

    223622 europarl_ro.sent.xml pseudo-senteces, formulaic senteces (parliament meetings)

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    31/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Usable data for the research question

    search for lines with at least a hyphen56155

    unique instances53897

    Filter irrelevant hypen occurences!

    Search for the non-reduced pronominal forms!

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    32/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Language knowledgeThe small universe of atonal pronouns in Romanian

    local phenomenon

    relatively small number of forms

    modelling any possible combination (even non-grammaticalones aka mal rules in error modelling)

    exhausitve modelling

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    33/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Language knowledgeExample: 1pers, Sg, Acc

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/http://goback/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    34/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Annotation runCurrent state

    pattern + context-testing functions current annotation state

    add all other optional forms licensed by the given context

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    35/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Annotation runIntended state

    Part-of-speech information needed!

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    36/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Part-of-Speech annotationCurrent state

    whole corpus pos-tagged using

    http://www.racai.ro/webservices/TextProcessing.aspx

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://www.racai.ro/webservices/TextProcessing.aspxhttp://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    37/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Towards the final formatSteps to do

    transform the MULTEX pos annotation into an xml format

    unify the annotation of optional sandhi with the pos annotation

    ... and then?

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    38/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    ... starts the real linguistic fun!Using the whole potential of the linguistic annotation

    Is there a significant difference between the occurences ofs a mi vs. s a-mi and, e.g., c a ti vs. c a-ti?

    taking more context into account (item before subjunction + itemafter the atonal pronoun) and count the syllable of the extendedcontext?

    What about the rhythm changes in the context (cf. the hugeamount and variation of reduced forms in the Romanian poetry)?

    include stylometric measurements

    What triggers the choice of a specific surface form?

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    39/46

    .......

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ........

    ..... ...

    ..... ...

    ........

    .

    Extending the linguistic playgroundAnnotating more (copyright-free) data

    Romanian part of the JRC-Acquis Multilingual Parallel Corpus

    DEX Dictionarul Explicativ al Limbii Romne

    Romanian Wikipedia

    articles (elaborated, well-formulated text) comments (informal, more personal)

    Copyright-free data is shareable data!

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    N l L G i L L i

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    40/46

    ... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .

    Natural Language Generation vs. Language Learning

    sharing the need to produce well-formed, situationally adequatenatural language utterances

    Why not sharing the knowledge as well?

    Why not the resources, too?

    Sharing data is not like sharing a slice of bread,

    rather like Jesus bread and fish miracle!

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    M hi h

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    41/46

    ... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .

    Machine vs. humanTransferability of constraint formulation from NLG to LL

    Is the constraint formatization from NLG transferable to the LLdomain?

    Yes! Linearization and surface realization have to be applied onperceivable entities.

    no room to generate partially empty strings

    no room to linearize traces or empty categories

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    E l f t i t f l ti i NLG

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    42/46

    ... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .

    Example of constraint formulation in NLGObligatory sandhy in the sequence of atonal pronouns

    Rule: The rightmost item in the atonal pronoun sequence can not be an open syllablewith nucleus [i].

    Assuming the base form [ni] ni:Is it the rightmost atonal pron in the sequence?

    1. yes change from [ni] ni to [ne] neIs there on the left an item to obligatorily attach to?

    1.1 yes (e.g., [ne] ne[a] a dat)

    attach [ne

    a] ne-a dat1.2 no (e.g., [ne] ne[dai

    ] dai)

    done [ne dai] ne dai2. no (e.g., [ni] ni [le] le[dai

    ] dai)

    done [ni le dai] ni le dai

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    O ti l dhi h

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    43/46

    ... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .

    Optional sandhi phenomenaExploting the specific language model

    analyse the context

    consult the specific language model

    give hints to students wrt. most appropriate form to choose

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Further possible applications

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    44/46

    ... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .

    Further possible applicationsExploiting specific language resources

    design and implementation of different types of languagelearning exercises for training atonal pronouns

    specific feedback to production error types because of mal-rulelike coding of non-licensed forms

    enriching existing analysis tools (parsers) with specificinformation

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Human vs machine

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    45/46

    ... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .

    Human vs. machineNLG too much of a technique, too little of a science

    Using NLG techniques for LL: rara avis

    Karin Harbusch et Al (2009) Computing Accurate Grammatical Feedback in a Virtual

    Writing Conference for German-Speaking Elementary-School Children: An Approach

    Based on Natural Language Generation, CALICO Journal, 26(3), 2009

    Using LL research insights for NLG

    NLG too much of a Fiat!-domain: from the very beginning

    NLG paying very little attention to surface phenomena such as

    language variation or even orthography

    modelling human language production: a real plus for NLG

    . . . . . . . . . . . . . . . . . . . . . .Atonal pronouns: Why a special corpus? . . . . . . . . . . . .Language knowledge: How to build it? . . . . . . .Language production: What are the benefits?

    Conclusions

    http://find/
  • 7/28/2019 Building a corpus for learning how to produce atonal pronouns in the Romanian clitic sequence

    46/46

    ... .... .... . ... .... .... .... . ... .... .... .... . ... .... .... .... . ... .... . ... .... .... .

    Conclusions

    motivating the need for special corpora for learning how to makedecisions in case of optional surface realization

    reporting on the cumbersome process of building resources forspecial phenomena

    stressing the need of resource and insights sharing betweenfields with similar goals

    underlining the benefits of sharing resource between NLG andLL wrt. realization of atonal pronouns in Romanian

    Share resources!

    http://find/http://goback/