fieldwork and grammaticography in a digital world · 2019-04-15 · corpus building/extension using...
TRANSCRIPT
![Page 1: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/1.jpg)
FieldworkandGrammaticographyinaDigitalWorld
JoshuaWilburFreiburgResearchGroupinSaamiStudies•UniversitätFreiburg
DescriptiveGrammarsandTypology•UniversityofHelsinki28March2019
1
![Page 2: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/2.jpg)
Overview• background• fieldwork• grammaticography
• otheradvances• outlook
2
FieldworkandGrammaticographyinaDigitalWorld
![Page 3: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/3.jpg)
BACKGROUND(aka:contextualization)
3
![Page 4: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/4.jpg)
PiteSaami• Uralic>Finno-Ugric>Saamic• spokenby~40individualsfromArjeplog/ÁrjepluovveinSwedishLapland• aka:Arjeplog-Saami,bidumsámegiella• nearlyallspeakersareatleast50• allspeakersarebilingual(PiteSaamiandSwedish/Arjeplogsmål)• noofficialorthography(yet...),butaworkingstandard• nomedia• Swedishdominateseverydaylife• hardlybeingpassedontoyoungergenerations
4
![Page 5: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/5.jpg)
5
PiteSaamilargerlinguisticstudies:• Halász1893(inHungarian)• Lagercrantz1926(inGerman)• Ruong1943(inGerman)• Lehtiranta1992(inFinnish)• Wilbur2014(inEnglish)• Sjaggo2015(inSwedish)
![Page 6: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/6.jpg)
6
PiteSaamilargerlinguisticstudies:• Halász1893(inHungarian)• Lagercrantz1926(inGerman)• Ruong1943(inGerman)• Lehtiranta1992(inFinnish)• Wilbur2014(inEnglish)• Sjaggo2015(inSwedish)
othermaterials:• extensivecollectionofheritagematerials(ISOF,Uppsala)• dictionary(PiteSaami->Swedish/English)
andproposedorthographicrules(2016)• onlinelexicaldatabase• onlineorthographicrules(includingspellchecker(inbeta))• smartphoneappintheworks
![Page 7: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/7.jpg)
7
largerlinguisticstudies:• Halász1893(inHungarian)• Lagercrantz1926(inGerman)• Ruong1943(inGerman)• Lehtiranta1992(inFinnish)• Wilbur2014(inEnglish)• Sjaggo2015(inSwedish)
othermaterials:• Extensivecollectionofheritagematerials(ISOF,Uppsala)• dictionary(PiteSaami->Swedish/English)
andproposedorthographicrules(2016)• onlinelexicaldatabase• onlineorthographicrules(includingspellchecker(inbeta))• smartphoneappintheworksrecentlinguisticsprojects:• Documentation(2008-2015;materialsarchivedatELARandTLA)• Lexicography(2016)• Syntacticstructures(2016-present)
PiteSaami
![Page 8: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/8.jpg)
8
largerlinguisticstudies:• Halász1893(inHungarian)• Lagercrantz1926(inGerman)• Ruong1943(inGerman)• Lehtiranta1992(inFinnish)• Wilbur2014(inEnglish)• Sjaggo2015(inSwedish)
othermaterials:• Extensivecollectionofheritagematerials(ISOF,Uppsala)• dictionary(PiteSaami->Swedish/English)
andproposedorthographicrules(2016)• onlinelexicaldatabase• onlineorthographicrules(includingspellchecker(inbeta))• smartphoneappintheworksrecentlinguisticsprojects:• Documentation(2008-2015;materialsarchivedatELARandTLA)• Lexicography(2016)• Syntacticstructures(2016-present)
PiteSaami
![Page 9: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/9.jpg)
9
->eachfieldworksituationisunique!
PiteSaami
• significantaspectsofmineinclude:• anaccessiblemoderntechnologicalinfrastructureon-site• aprevioushistoryoflinguisticswork• extensivelanguagetechnologytoolsforclosely-relatedlanguages• messybutextantorthographic“tradition”whenIstarted
![Page 10: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/10.jpg)
FIELDWORKinadigitalworld
10
![Page 11: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/11.jpg)
toolsforfieldwork
• intheolddays:notebookandpencil• nowadays:– recordingequipment– laptop– digitalbackupcapacity(eveninthecloud)– transcriptionsoftware(ELAN)– mobilephones– socialmedia(e.g.:forstayingincontact,datasource)
– grammaticographysoftware(e.g.FLExforinterlinearization)
– languagetechnology… 11
![Page 12: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/12.jpg)
• modern,affordabledigitalrecordingtechnologies(especiallyvideo)allowfieldworkerstocapturemuchmorethanjustlanguage,buttheentirehumanevent– morecompletedocumentation,potentiallyusefulbeyondlinguistics*
12
whynotuse:• bodycameras• drones• surround-soundmicrophones• 360°cameras• 3-Dcameras... *cf.Rießler&Wilbur2017
datacollectionandfieldwork
![Page 13: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/13.jpg)
(re-)collectingolddata(heritageharvesting)
• OCR(opticalcharacterrecognition)
13
*cf.Partanen&Rießler2019
• embeddedtext(morethanjustscanning!)
• canbeexported(e.g.toELAN)
• canbepartofacorpus*
![Page 14: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/14.jpg)
(re-)collectingolddata(heritageharvesting)
• HTR(handwrittentextrecognition)
14
• embeddedtext(morethanjustscanning!)
• canbeexported(e.g.toELAN)
• canbepartofacorpus*• muchmorecomplexthan
OCR,thusitcurrentlyrequiresmuchmoretrainingdatabeforeit’suseful
*cf.Transkribusproject(Kahle2017);alsoBloklandetal2019forabriefdiscussion
![Page 15: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/15.jpg)
GRAMMATICOGRAPHYinadigitalworld
15
![Page 16: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/16.jpg)
briefhistoryofgrammaticography
• 1/3oftheBoasiantrilogy…
• Payne1997,Mosel2006,Aikhenvald2015,etc.
• Nordhoff2008ElectronicReferenceGrammarsforTypology:ChallengesandSolutions
• Implementedgrammars(incorporationincorpusandcomputationallinguistics)
16
![Page 17: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/17.jpg)
digitaltoolsforgrammaticography
17
• goodforconcatenativemorphology• play,play-s,play-ed,play-er,play-er-s
• notsogoodfornon-linearmorphology• sing,sing-s,sang,sung
Whatdoyoudowhennon-linearmorphologyisthedefaultinyourlanguage?
• Toolbox,FLEx
![Page 18: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/18.jpg)
digitaltoolsforgrammaticography
18
SG PL
NOM juällge juolgeGEN juolge julgijACC juolgev julgijtILL juallgáj julgijda
INESS juolgen julgijnELAT juolgest julgijstCOM julgijna julgij
ABESS juolgedak juolgedagaESS juallgen
juällge‘foot/leg’
Whatdoyoudowhennon-linearmorphologyisthedefaultinyourlanguage?
• Toolbox,FLEx
![Page 19: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/19.jpg)
digitaltoolsforgrammaticography
• Toolbox,FLEx• other,digitalapproaches...
19
SG PL
NOM juällge juolgeGEN juolge julgijACC juolgev julgijtILL juallgáj julgijda
INESS juolgen julgijnELAT juolgest julgijstCOM julgijna julgij
ABESS juolgedak juolgedagaESS juallgen
juällge‘foot/leg’
4stemallomorphs:juällg-juolg-juallg-julg-Whatdoyoudowhen
non-linearmorphologyisthedefaultinyourlanguage?
![Page 20: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/20.jpg)
implementedgrammars
• aka“precise”grammars– self-validating
• computer-processable– butonlyborderlinehuman-readable(atleastfromatraditionalistperspective)
– computationallinguists,typicallyHPSG
• analyzelinguisticstructures• implementation-->parseandtagacorpus
20cf.newLanguageSciencePressseries“ImplementedGrammars”
Siegeletal.2016
![Page 21: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/21.jpg)
21
• Giellateknoinfrastructure:– FST–FiniteStateTransducer1– CG–ConstraintGrammar2
• automaticannotationsinELAN…
1Beesley&Karttunen2003;2Didriksen2007–2018,Karlsson1990;Karlssonetal.1995
theResearchgroupforSaamilanguagetechnologyatUniversityTromsø
implementedgrammar(FST/CG)forPiteSaami
![Page 22: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/22.jpg)
implementedgrammar(FST/CG)forPiteSaami
22
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
ConstraintGrammar(CG)→forremovingambiguitiesinFSToutput formalism:
• lexc (lexicon,PoS,linearmorphology)
• twolc (non-linearmorphology)
• cg3 (syntax)
Usesorthographicstandard!
![Page 23: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/23.jpg)
implementedgrammar(FST/CG)forPiteSaami
23
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
formalism:
• lexc (lexicon,PoS,linearmorphology)
• twolc (non-linearmorphology)
Outputanalyses:
![Page 24: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/24.jpg)
implementedgrammar(FST/CG)forPiteSaami
24
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
input:wordform
output:wordformlemma+PoS+Morphology
juällge!juällge juällge+N+Sg+Nom!!julgijd!julgijd juällge+N+Pl+Acc!
![Page 25: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/25.jpg)
implementedgrammar(FST/CG)forPiteSaami
25
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
formalism:
• lexc (lexicon,PoS,linearmorphology)juällge juällge+N+Sg+Nom!!julgijd juällge+N+Pl+Acc!
![Page 26: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/26.jpg)
implementedgrammar(FST/CG)forPiteSaami
26
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
formalism:
• twolc (non-linearmorphology)juällge juällge+N+Sg+Nom!!julgijd juällge+N+Pl+Acc!
![Page 27: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/27.jpg)
implementedgrammar(FST/CG)forPiteSaami
27
infrastructure:
FiniteStateTransducer(FST)→forgeneratingwordforms
(itworksinbothdirections)
![Page 28: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/28.jpg)
implementedgrammar(FST/CG)forPiteSaami
28
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
BUT:howtodealwithmorphologicallyambiguouswordforms?(disambiguation)
![Page 29: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/29.jpg)
implementedgrammar(FST/CG)forPiteSaami
29
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
ConstraintGrammar(CG)→forremovingambiguitiesinFSToutput formalism:
• lexc (lexicon,PoS,linearmorphology)
• twolc (non-linearmorphology)
• cg3 (syntax)
example:rulesdescribingdependencybetweenadpositionsandgenitivecase
![Page 30: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/30.jpg)
implementedgrammar(FST/CG)forPiteSaami
30
infrastructure:
FiniteStateTransducer(FST)→foranalyzingwordforms
ConstraintGrammar(CG)→forremovingambiguitiesinFSToutput formalism:
• lexc (lexicon,PoS,linearmorphology)
• twolc (non-linearmorphology)
• cg3 (syntax)output(analyses)
![Page 31: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/31.jpg)
nala gähttjat tjurvij daj
disambiguationexample
31
![Page 32: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/32.jpg)
nala gähttjat tjurvij daj
onto look+INF antler+GEN+PLantler+COM+PL
DET+GEN+PLDET+COM+PLPRON+GEN+PLPRON+COM+PL
disambiguationexample
32
FSToutput:
![Page 33: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/33.jpg)
daj tjurvij nala gähttjat
‘tolookatthoseantlers’ [pit100405b.011]
disambiguationexample
33
FSToutput:
![Page 34: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/34.jpg)
daj tjurvij nala gähttjat
da-j tjurvi-j nala gähttja-t
DET-GEN.PL antler-GEN.PL onto look-INF
‘tolookatthoseantlers’ [pit100405b.011]
disambiguationexample
34
FSToutput: CGsyntacticdisambiguation:
• postpositionsgoverngenitiveNPsSELECT Gen IF (*1C Po BARRIER NoNP);
• pronounsarenotembeddedinanNPREMOVE Pron IF (*1C N BARRIER NPNH);
![Page 35: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/35.jpg)
implementedgrammarspros:• entirelydigital(easycopying,versioning,etc.)• computer-processable• cananalyzeANDgenerate(usefulforpracticaltools,e.g.teaching
apps)• accuracycanbetestedonrealempiricaldata• prosecanbeincluded(as<!—comments-->)• furtheruseinother,digitalapplications...
35
cons:• requiressignificanttechnicalknowhowtolearnandtoimplement• notveryhuman-readable,especiallyfornon-specialists
– proseisonlyincludedas<!--comments-->– notidealforstandardaveragetypologists– notevenclosetoidealformostnon-linguists
![Page 36: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/36.jpg)
36
• spell-checkers• grammar-checkers
• teachingmaterials(e.g.apps)
…
furtheruseinother,digitalapplications...
![Page 37: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/37.jpg)
37
• spell-checkers• grammar-checkers
• teachingmaterials(e.g.apps)
…
• indocumentarylinguistics/endangeredlanguagedescriptions– automatictokenizationandannotationforcorpora
furtheruseinother,digitalapplications...
![Page 38: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/38.jpg)
furtheruseinother,digitalapplications...
38
• tierstructureinELANcorpora(Freiburg-style)
includingannotationsfor:• Lemma• Partofspeech• Morphologicalcategories• Gloss
![Page 39: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/39.jpg)
furtheruseinother,digitalapplications...
39
benefits:• savestime• avoidsinconsistencies• canbeupdatedautomatically
corpusbuilding/extensionusingascript1that:1. tokenizestheorthographicrepresentation
2. sendseachtokenthroughFST3. removesambiguitiesusingCG
4. addsanEnglishgloss
5. insertsthisoutputintoELAN
1cf.Bloklandetal.2015;Gerstenbergeretal.2016;Gerstenbergeretal.2017
• tierstructureinELANcorpora(Freiburg-style)
Moredetailsintalkat11:30inroom13byBlokland,PartanenandRießler
![Page 40: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/40.jpg)
summaryofdigitalgrammaticography
40
requires:• timetolearntheformalismandsetuptheinfrastructure• understandingofgrammaticalstructures• string-basedrepresentationoflanguage
mainbenefits:• canbefreelyaccessibleonline• possibilitytopublish(hopefullygettingacademicrecognition,cf.LangSciPressseries)• exportdataforuseinothertoolsanddisciplines
• spell-checker• lexicographicmaterials(includingsmartphoneapps)• corpusbuilding• teachingmaterials• increasedstatusforthelanguage• moreaccessibletootherdisciplines,e.g.viatextsearch
maindrawbacks:• notterriblyhuman-accessible• nottaughttraditionallyinGeneralLinguisticsprograms
![Page 41: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/41.jpg)
OTHERADVANCESindigitaltechnologies
41
![Page 42: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/42.jpg)
newlanguagetechnologies• automaticsegmentation,e.g.:– Autosegmenteerija2.0
• Estonianautosegmentationforced-alignmenttestedonPiteSaamiwithsurprisinglyaccurateresults:
42
![Page 43: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/43.jpg)
newlanguagetechnologies• speechrecognition,e.g.:– CommonVoice(moz://a)incommunitydevelopmentforanumberofsmallerlanguages(e.g.:Erzya,Komi-Zyrian,...)
43
![Page 44: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/44.jpg)
newlanguagetechnologies• automaticimplementedgrammarproduction– LinGOGrammarMatrix
http://matrix.ling.washington.edu/customize/matrix.cgi
44
![Page 45: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/45.jpg)
newlanguagetechnologies• automaticimplementedgrammarproduction– LinGOGrammarMatrix
http://matrix.ling.washington.edu/customize/matrix.cgi
45
![Page 46: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/46.jpg)
newspeechtechnologies
• relevanttechnologiesbeingdevelopedcontinuously
• leadingtoasignificantincreaseinefficiencyforcorpusbuilding
46
->bettergrammaticaldescriptions
![Page 47: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/47.jpg)
OUTLOOK
47
![Page 48: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/48.jpg)
outlook
• digitaltoolscanprovidepowerfuladvantagesforbothfieldworkand(especially)grammaticographyanddocumentation
• but:theyrequireknowhowthatgoesbeyondatypicallinguist’straining
• I’mnotsayingthisisforeveryone,andrealisticallyonlypartswillberelevantforafew–thepointis:Digitaltechnologiesshouldbeconsidered,too!
48
![Page 49: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/49.jpg)
References
49
Aikhenvald,AlexandraY.(2015).Theartofgrammar.Apracticalguide.Oxford:OxfordUniversityPress.Beesley,KennethR.&LauriKarttunen(2003).FiniteStateMorphology.Stanford:CenterfortheStudyofLanguageandInformation.Blokland,Rogier,CiprianGerstenberger,MarinaFedina,NikoPartanen,MichaelRießler,&JoshuaWilbur(2015).“Languagedocumentationmeetslanguage
technology”.In:FirstInternationalWorkshoponComputationalLinguisticsforUralicLanguages,16thJanuary,2015,Tromsø,Norway.Proceedingsoftheworkshop.Ed.byTommiA.Pirinen,FrancisM.Tyers,&TrondTrosterud.SeptentrioConferenceSeries2015:2.Tromsø:TheUniversityLibraryofTromsø,pp.8–18.
Blokland,Rogier,NikoPartanen,MichaelRießler,&JoshuaWilbur(2019).“Usingcomputationalapproachestointegrateendangeredlanguagelegacydataintodocumentationcorpora.Pastexperiencesandchallengesahead”.In:ProceedingsoftheWorkshoponComputationalMethodsforEndangeredLanguages.Vol.2.Honolulu:AssociationforComputationalLinguistics,pp.24–30.
Didriksen,Tino(2007–2018).Constraintgrammarmanual.3rdversionoftheCGformalismvariant.GrammarSoftApS.Gerstenberger,Ciprian,NikoPartanen,MichaelRießler,&JoshuaWilbur(2016).“UtilizinglanguagetechnologyinthedocumentationofendangeredUralic
languages”.In:NorthernEuropeanJournalofLanguageTechnology4,pp.29–47.Gerstenberger,Ciprian,NikoPartanen,MichaelRießler,&JoshuaWilbur(2017).“Instantannotations.ApplyingNLPmethodstotheannotationofspokenlanguage
documentationcorpora”.In:InternationalWorkshoponComputationalLinguisticsforUraliclanguages(IWCLUL2017).Ed.byTommiA.Pirinen,MichaelRießler,TrondTrosterud,&FrancisM.Tyers.St.Petersburg:AssociationforComputationalLinguistics,pp.25–36.
Halász,Ignácz(1893).Népköltésigyűjtemény.APiteLappmarkarjepluogiegyházkerületéből.Vol.5.Svéd-LappNyelv.Budapest:Magyartudományosakadémia.Kahle,Philip,SebastianColutto,GünterHackl,&GüngerMühlberger(2017).“Transkribus.AServicePlatformforTranscription,RecognitionandRetrievalof
HistoricalDocuments”.In:201714thIAPRInternationalConferenceonDocumentAnalysisandRecognition(ICDAR).Vol.04,pp.19–24.Karlsson,Fred(1990).“ConstraintGrammarasaframeworkforparsingunrestrictedtext”.In:Proceedingsofthe13thInternationalConferenceofComputational
Linguistics.Ed.byHansKarlgren.Vol.3.Helsinki,pp.168–173.Karlsson,Fred,AtroVoutilainen,JuhaHeikkila,&ArtoAnttila,eds.(1995).ConstraintGrammar.Alanguage-independentsystemforparsingunrestrictedtext.
NaturalLanguageProcessing4.Berlin:MoutondeGruyter.Lagercrantz,Eliel(1926).SprachlehredesWestlappischennachderMundartvonArjeplog.Suomalais-ugrilaisenSeuranToimituksia55.Helsinki:Suomalais-
UgrilainenSeura.Lehtiranta,Juhani(1992).Arjeploginsaamenäänne-jataivutusopinpääpiirteet.Suomalais-ugrilaisenSeurantoimituksia212.Helsinki:Suomalais-UgrilainenSeura.Mosel,Ulrike(2006).“Grammaticography.Theartandcraftofwritinggrammars”.In:Catchinglanguage.Thestandingchallengeofgrammarwriting.Ed.byFelix
Ameka,AlanDench,&NicholasEvans.Trendsinlinguistics:studiesandmonographs167.Berlin:MoutondeGruyter,pp.41–68.Nordhoff,Sebastian(2008).“ElectronicReferenceGrammarsforTypology:ChallengesandSolutions”.In:LanguageDocumentationandConservation2.2,pp.296–
324.Partanen,Niko&MichaelRießler(2019).“AnOCRsystemfortheUnifiedNorthernAlphabet”.In:InternationalWorkshoponComputationalLinguisticsforUralic
languages(IWCLUL2019).Tartu:AssociationforComputationalLinguistics,pp.77–89.Payne,ThomasE.(1997).Describingmorphosyntax.Aguideforfieldlinguists.Cambridge:CambridgeUniversityPress.Rießler,Michael&JoshuaWilbur(2017).“DocumentingendangeredoralhistoriesoftheArctic.Aproposedsymbiosisforlanguagedocumentationandoralhistory
research,illustratedbySaamiandKomiexamples”.In:Oralhistorymeetslinguistics.Ed.byErichKasten,KatjaRoller,&JoshuaWilbur.ExhibitionsandSymposia.Fürstenberg:KulturstiftungSibirien,pp.31–64.
Ruong,Israel(1943).LappischeVerbalableitungdargestelltaufGrundlagedesPitelappischen.Uppsala:AlmqvistochWiksell.Siegel,Melanie,EmilyM.Bender,&FrancisBond(2016).Jacy.AnImplementedGrammarofJapanese.CSLIStudiesinComputationalLinguistics.Stanford:CSLI
Publications.Sjaggo,Ann-Charlotte(2015).Pitesamiskgrammatik.enjämförandestudiemedlulesamiska.Senterforsamiskestudiersskriftserie20.Tromsø:Septentrio
AcademicPublishing.Wilbur,Joshua(2014).AgrammarofPiteSaami.StudiesinDiversityLinguistics5.Berlin:LanguageSciencePress.Wilbur,Joshua,ed.(2016).Pitesamiskordboksamtstavningsregler.Samica2.Freiburg:Albert-Ludwigs-UniversitätFreiburg.
![Page 50: Fieldwork and Grammaticography in a Digital World · 2019-04-15 · corpus building/extension using a script1 that: 1. tokenizes the orthographic representation 2. sends each token](https://reader034.vdocuments.site/reader034/viewer/2022042410/5f27d8207e5e860fbf74882b/html5/thumbnails/50.jpg)
Gijtovadnet!gijtov adnet
gijto-v adne-t
thank-ACC.SG have-PL.IMP
JoshuaWilburPiteSaamiSyntaxProject
FreiburgResearchGroupinSaamiStudiesjoshua.wilbur@skandinavistik.uni-freiburg.de
withspecialthankstoMichaelRießler,NikoPartanen,RogierBloklandandCiprianGerstenberger
forideas,collaborationandinspiration