grapho-phonological parsing: corpus annotation for … › bmolinea › eshp3.pdf‘invesqgaqng...

1
Data available from LAOS: Example: LAOS text #341, NLS Ms 34.4.3 Year: The From Inglis To Scots (FITS) Project and Older Scots phonology FITS (AHRC grant number AH/L004542/1) is a four-year project at the Angus McIntosh Centre for Historical LinguisQcs. Focus: the sound/spelling history of early Scots as evidenced in root morphemes of Germanic origin Main RQ: What phonological facts underlie the diversity of spelling aWested in Scots of the period 1380-1500? Main output: a freely available, fully searchable online database which establishes, quanQfies and visualises relaQons between units of sound and their spellings. Possible user-defined quesQons: What sound(s) did the digraph <ch> represent in 15th-century Scots? When and where is theta-hardening ([θ] > [t]) aWested in early Scots spellings? What are the reflexes of Old English /f/ in 15th-century Scots? Historical corpus phonology: can it be done? VariaQon in non-standardised alphabeQc systems, such as those of pre-modern Europe, has long been exploited to reconstruct diachronic and diatopic alternants in phonological histories (e.g. McIntosh 1956; Laing & Lass 2003). However, electronic corpora for the history of language are rarely built with phonological quesQons in mind. Historical sound substance is mediated by a graphic system which makes it difficult to interpret the basic facts. The building of historical phonological corpora, while possible, requires a fair degree of preliminary analysis in order to establish the potenQal sound-spelling mappings of the language. While this may be a painstaking first step, it is surprising no bespoke tools have thus far been developed to assist in the process. The original data set: A Linguis3c Atlas of Older Scots (‘LAOS’, Williamson 2008) • c.1,250 ‘local documents’ Burgh records, charters, deeds, wills, etc. • c.400,000 words • Mostly localised and dated 1380-1500 • DiplomaQcally transcribed and lexico-grammaQcally tagged Bibliography Aitken, A. J. & Caroline Macafee. 2002. The Older Scots vowels: A history of the stressed vowels of Older Scots from the beginnings to the eighteenth century. Edinburgh: The Scoqsh Text Society. Alcorn, Rhona, Benjamin Molineaux, Joanna Kopaczyk, Vasilios Karaiskos, BeWelou Los & Warren Maguire. 2017. 'The emergence of Scots: Clues from Germanic *a reflexes' in J. Cruickshank and R. McColl Millar (eds.) Before the Storm: Papers from the Forum for Research on the Languages of Scotland and Ulster triennial meeBng, Ayr 2015, pp. 1-32. Aberdeen: FRLSU. CoNE. 2013 A Corpus of NarraBve Etymologies from Proto-Old English to Early Middle English and accompanying Corpus of Changes compiled by Roger Lass, Margaret Laing, Rhona Alcorn & Keith Williamson [hWp://www.lel.ed.ac.uk/ihd/CoNE/CoNE.html]. Edinburgh: Version 1.1, 2013-, © The University of Edinburgh. Maguire, Warren, Alcorn, Rhona, Benjamin Molineaux, Joanna Kopaczyk, Daisy Smith, Vasilios Karaiskos & BeWelou Los. Forthcoming. ‘InvesQgaQng evidence for final [v]-devoicing in Older Scots’. McIntosh, Angus 1956. ‘The analysis of Middle English texts’. TransacBons of the Philological Society 55(1): 26-55. Molineaux, Benjamin, Joanna Kopaczyk, Warren Maguire, Rhona Alcorn, Vasilios Karaiskos & BeWelou Los. 2016. ‘Tracing L-vocalisaQon in early Scots’. Papers in Historical Phonology 1, pp. 187-217. Molineaux, Benjamin, Joanna Kopaczyk, Warren Maguire, Rhona Alcorn, Vasilios Karaiskos & BeWelou Los. Forthcoming. ‘An emergent 15c Scots spelling norm: contrasQve voicing in dental fricaQves’ Johnston, Paul. 1997. ‘Older Scots phonology and its regional variaQon’. In Charles Jones (ed.) The Edinburgh history of the Scots language, 47-111. Edinburgh: Edinburgh University Press. Kopaczyk, Joanna, Benjamin Molineaux, Vasilios Karaiskos, Rhona Alcorn, BeWelou Los & Warren Maguire. 2018. ‘Towards a grapho- phonologically parsed corpus of medieval Scots: Database design and technical soluQons’, Corpora 13(2). Laing, Margaret & Roger Lass. 2003. ’Tales of 1001 nists: The phonological implicaQons of liWeral subsQtuQon sets in some thirteenth- century South-West Midland texts', English Language and LinguisBcs 7(2), pp.257-278. LAOS. 2008. A LinguisBc Atlas of Older Scots, Phase 1: 1380-1500. Compiled by Keith Williamson. Retrieved from hWp:// www.lel.ed.ac.uk/ihd/laos1/ laos1.html. The University of Edinburgh. Grapho-phonological parsing: Mapping spellings to sounds: We assume that our source materials were set down by scribes “capable of sophisQcated and subtle linguisQc analysis” (Laing & Lass 2003: 258), so we expect there to be a systemaQc connecQon — albeit not necessarily a one-to-one match — between orthographic choices and underlying sound systems. Each variant spelling of the root-morphemes in the LAOS corpus is broken up into a sequence of graphemic units, preserving their morphological/graphological context. Each grapheme is then assigned a plausible sound value by triangulaQng on a number of factors (see Kopaczyk et al. 2018 for details): The Medusa: Grapho-phonological sets visualisation* Geographical pinpointing of attestations Viewing attestations in context (texts) Mapping sounds to sources: The diachronic dimension Since a sizeable amount of well-described data is available for the Germanic sources of Older Scots, (Old English, Norse and Middle Dutch), we can idenQfy most of the likely historical antecedents of our target morphemes. This allows us to pinpoint parQcular diachronic trajectories for sounds and morphemes, helping us also improve the accuracy of our proposed sound values for the Older Scots period. We aWempt to match each sound in the Older Scots layer to the aWested form in the relevant (usually northern) dialects of Old English, as well as Norse and Middle Dutch. Where there is a mismatch between the source and the corpus form, we propose a change, drawing on exisQng literature and the general distribuQon of our aWested variants. What can you do with a grapho-phonologically parsed corpus? The corpus allows for a fine-grained examinaQon of the phonotacQc and morphotacQc distribuQon of individual sound-spelling pairings as well as variaQon in their values over Qme, space and text. It further allows users to: • Select specific sound, orthographic and grammaQcal environments • Define temporal and spaQal domains for search results • Trace etymological sources morpheme-by-morpheme and sound-by-sound • Link etymological sources to corpus aWestaQons via a Corpus of Changes • Further invesQgate forms via links to the online DicQonary of the Scots Language and OED • Access full source texts for context-checking and creaQng scribal profiles *Medusa II is under development and will allow mapping of source segments to attestations in our corpus We start with individual tokens and establish paWerns across the enQre data set. As more data is entered in the database, the iniQal assumpQons are reevaluated. Gradually, we establish a network of relaQonships between the graphemic units and their plausible underlying sounds. We use a bespoke visualisaQon tool called Medusa. Sound subsKtuKon sets idenQfy sounds associated with a specific grapheme Graphemic subsKtuKon sets idenQfy graphemes associated with a parQcular sound. Many sounds and graphemes belong to more than one set. A corpus of Changes Following the example in CoNE (Lass et al. 2013), we give a detailed descripQon of each one of the changes invoked to map the proposed source form to the plausible FITS sound value. A graphemic subsKtuKon set: [ð] & [θ], morpheme iniKally The Corpus of Changes: What have we found out so far? 1. Our period has few localisable innovaQons, as might be expected from a relaQvely new dialect (Alcorn et al. 2017). 2. Changes described elsewhere as quick and complete (such as L-vocalisaQon) may progress slowly over Qme, phonological environments and the lexicon (Molineaux et al. 2016). 3. Older Scots as a whole innovated disQnct spelling convenQons such as <y> for [ð] vs. <th> for [θ] (Molineaux et al. forthcoming) 4. Some changes advanced during our period and later probably reversed (especially in the face of AnglicisaQon), such as the case of pre-inflecQonal devoicing of fricaQves (Maguire et al. forthcoming) ProporQon of medial <y> (orange), <th> (grey) and <þ> (yellow) for etymological [ð] by decade. Black line = data density. leader trailer A sound subsKtuKon set: <ch> Table view (+ data extraction) Based on all FITS morphemes with <ch> [x] = aucht, dochter, loch [ç] = nicht, echt, hech … [θ] = bach, lench, muoch, strencht [ʧ] = chalys, cheike, chekin, cheis [k] = chorn, chynde, chechyne [ð] = worchy, nechtir, skachlaß … Data capture tool Grapho-phonological parsing: Corpus annotation for historical phonology B. MOLINEAUX 1 , J. KOPACZYK 2 , V. KARAISKOS 1 , D. SMITH 1 , W. MAGUIRE 1 , R. ALCORN 1 & B. LOS 1 1 The University of Edinburgh; 2 The University of Glasgow [ð]-morphemes: thus, there, those, thence, etc. [θ]-morphemes: three, thief, think, thaw, thank etc. The FITS Toolbox spellings sounds

Upload: others

Post on 06-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Grapho-phonological parsing: Corpus annotation for … › bmolinea › ESHP3.pdf‘InvesQgaQng evidence for final [v]-devoicing in Older Scots’. McIntosh, Angus 1956. ‘The analysis

DataavailablefromLAOS:

Example:LAOStext#341,NLS

Ms34.4.3 Year:

TheFromInglisToScots(FITS)ProjectandOlderScotsphonologyFITS(AHRCgrantnumberAH/L004542/1)isafour-yearprojectattheAngusMcIntoshCentreforHistoricalLinguisQcs.Focus:thesound/spellinghistoryofearlyScotsasevidencedinrootmorphemesofGermanicoriginMainRQ:WhatphonologicalfactsunderliethediversityofspellingaWestedinScotsoftheperiod1380-1500?Mainoutput:afreelyavailable,fullysearchableonlinedatabasewhichestablishes,quanQfiesandvisualisesrelaQonsbetweenunitsofsoundandtheirspellings.Possibleuser-definedquesQons:• Whatsound(s)didthedigraph<ch>representin15th-centuryScots?• Whenandwhereistheta-hardening([θ]>[t])aWestedinearlyScotsspellings?• WhatarethereflexesofOldEnglish/f/in15th-centuryScots?

Historicalcorpusphonology:canitbedone?VariaQoninnon-standardisedalphabeQcsystems,suchasthoseofpre-modernEurope,haslongbeenexploitedtoreconstructdiachronicanddiatopicalternantsinphonologicalhistories(e.g.McIntosh1956;Laing&Lass2003).However,electroniccorporaforthehistoryoflanguagearerarelybuiltwithphonologicalquesQonsinmind.Historicalsoundsubstanceismediatedbyagraphicsystemwhichmakesitdifficulttointerpretthebasicfacts.Thebuildingofhistoricalphonologicalcorpora,whilepossible,requiresafairdegreeofpreliminaryanalysisinordertoestablishthepotenQalsound-spellingmappingsofthelanguage.Whilethismaybeapainstakingfirststep,itissurprisingnobespoketoolshavethusfarbeendevelopedtoassistintheprocess.

Theoriginaldataset:ALinguis3cAtlasofOlderScots(‘LAOS’,Williamson2008)• c.1,250‘localdocuments’Burghrecords,charters,deeds,wills,etc.• c.400,000words• Mostlylocalisedanddated1380-1500• DiplomaQcallytranscribedandlexico-grammaQcallytagged

BibliographyAitken,A.J.&CarolineMacafee.2002.TheOlderScotsvowels:AhistoryofthestressedvowelsofOlderScotsfromthebeginningstothe

eighteenthcentury.Edinburgh:TheScoqshTextSociety.Alcorn,Rhona,BenjaminMolineaux,JoannaKopaczyk,VasiliosKaraiskos,BeWelouLos&WarrenMaguire.2017.'Theemergenceof

Scots:CluesfromGermanic*areflexes'inJ.CruickshankandR.McCollMillar(eds.)BeforetheStorm:PapersfromtheForumforResearchontheLanguagesofScotlandandUlstertriennialmeeBng,Ayr2015,pp.1-32.Aberdeen:FRLSU.

CoNE.2013ACorpusofNarraBveEtymologiesfromProto-OldEnglishtoEarlyMiddleEnglishandaccompanyingCorpusofChangescompiledbyRogerLass,MargaretLaing,RhonaAlcorn&KeithWilliamson[hWp://www.lel.ed.ac.uk/ihd/CoNE/CoNE.html].Edinburgh:Version1.1,2013-,©TheUniversityofEdinburgh.

Maguire,Warren,Alcorn,Rhona,BenjaminMolineaux,JoannaKopaczyk,DaisySmith,VasiliosKaraiskos&BeWelouLos.Forthcoming.‘InvesQgaQngevidenceforfinal[v]-devoicinginOlderScots’.

McIntosh,Angus1956.‘TheanalysisofMiddleEnglishtexts’.TransacBonsofthePhilologicalSociety55(1):26-55.Molineaux,Benjamin,JoannaKopaczyk,WarrenMaguire,RhonaAlcorn,VasiliosKaraiskos&BeWelouLos.2016.‘TracingL-vocalisaQon

inearlyScots’.PapersinHistoricalPhonology1,pp.187-217.Molineaux,Benjamin,JoannaKopaczyk,WarrenMaguire,RhonaAlcorn,VasiliosKaraiskos&BeWelouLos.Forthcoming.‘Anemergent

15cScotsspellingnorm:contrasQvevoicingindentalfricaQves’Johnston,Paul.1997.‘OlderScotsphonologyanditsregionalvariaQon’.InCharlesJones(ed.)TheEdinburghhistoryoftheScots

language,47-111.Edinburgh:EdinburghUniversityPress.Kopaczyk,Joanna,BenjaminMolineaux,VasiliosKaraiskos,RhonaAlcorn,BeWelouLos&WarrenMaguire.2018.‘Towardsagrapho-

phonologicallyparsedcorpusofmedievalScots:DatabasedesignandtechnicalsoluQons’,Corpora13(2).Laing,Margaret&RogerLass.2003.’Talesof1001nists:ThephonologicalimplicaQonsofliWeralsubsQtuQonsetsinsomethirteenth-centurySouth-WestMidlandtexts',EnglishLanguageandLinguisBcs7(2),pp.257-278.

LAOS.2008.ALinguisBcAtlasofOlderScots,Phase1:1380-1500.CompiledbyKeithWilliamson.RetrievedfromhWp://www.lel.ed.ac.uk/ihd/laos1/laos1.html.TheUniversityofEdinburgh.

Grapho-phonologicalparsing:Mappingspellingstosounds: Weassumethatoursourcematerialsweresetdownbyscribes“capableofsophisQcatedandsubtlelinguisQcanalysis”(Laing&Lass2003:258),soweexpecttheretobeasystemaQcconnecQon—albeitnotnecessarilyaone-to-onematch—betweenorthographicchoicesandunderlyingsoundsystems.Eachvariantspellingoftheroot-morphemesintheLAOScorpusisbrokenupintoasequenceofgraphemicunits,preservingtheirmorphological/graphologicalcontext.

EachgraphemeisthenassignedaplausiblesoundvaluebytriangulaQngonanumberoffactors(seeKopaczyketal.2018fordetails):

TheMedusa:Grapho-phonologicalsetsvisualisation*

Geographicalpinpointingofattestations

Viewingattestationsincontext(texts)

Mappingsoundstosources:ThediachronicdimensionSinceasizeableamountofwell-describeddataisavailablefortheGermanicsourcesofOlderScots,(OldEnglish,NorseandMiddleDutch),wecanidenQfymostofthelikelyhistoricalantecedentsofourtargetmorphemes.ThisallowsustopinpointparQculardiachronictrajectoriesforsoundsandmorphemes,helpingusalsoimprovetheaccuracyofourproposedsoundvaluesfortheOlderScotsperiod.WeaWempttomatcheachsoundintheOlderScotslayertotheaWestedformintherelevant(usuallynorthern)dialectsofOldEnglish,aswellasNorseandMiddleDutch.Wherethereisamismatchbetweenthesourceandthecorpusform,weproposeachange,drawingonexisQngliteratureandthegeneraldistribuQonofouraWestedvariants.

Whatcanyoudowithagrapho-phonologicallyparsedcorpus?Thecorpusallowsforafine-grainedexaminaQonofthephonotacQcandmorphotacQcdistribuQonofindividualsound-spellingpairingsaswellasvariaQonintheirvaluesoverQme,spaceandtext.Itfurtherallowsusersto:• Selectspecificsound,orthographicandgrammaQcalenvironments• DefinetemporalandspaQaldomainsforsearchresults• Traceetymologicalsourcesmorpheme-by-morphemeandsound-by-sound• LinketymologicalsourcestocorpusaWestaQonsviaaCorpusofChanges• FurtherinvesQgateformsvialinkstotheonlineDicQonaryoftheScotsLanguageandOED• Accessfullsourcetextsforcontext-checkingandcreaQngscribalprofiles

*MedusaIIisunderdevelopmentandwillallowmappingofsourcesegmentstoattestationsinourcorpus

WestartwithindividualtokensandestablishpaWernsacrosstheenQredataset.Asmoredataisenteredinthedatabase,theiniQalassumpQonsarereevaluated.Gradually,weestablishanetworkofrelaQonshipsbetweenthegraphemicunitsandtheirplausibleunderlyingsounds.WeuseabespokevisualisaQontoolcalledMedusa.• SoundsubsKtuKonsetsidenQfysoundsassociatedwithaspecificgrapheme• GraphemicsubsKtuKonsetsidenQfygraphemesassociatedwithaparQcularsound.Manysoundsandgraphemesbelongtomorethanoneset.

AcorpusofChangesFollowingtheexampleinCoNE(Lassetal.2013),wegiveadetaileddescripQonofeachoneofthechangesinvokedtomaptheproposedsourceformtotheplausibleFITSsoundvalue.

AgraphemicsubsKtuKonset:[ð]&[θ],morphemeiniKally

TheCorpusofChanges:

Whathavewefoundoutsofar?1. OurperiodhasfewlocalisableinnovaQons,asmightbe

expectedfromarelaQvelynewdialect(Alcornetal.2017).2. Changesdescribedelsewhereasquickandcomplete(such

asL-vocalisaQon)mayprogressslowlyoverQme,phonologicalenvironmentsandthelexicon(Molineauxetal.2016).

3. OlderScotsasawholeinnovateddisQnctspellingconvenQonssuchas<y>for[ð]vs.<th>for[θ](Molineauxetal.forthcoming)

4. Somechangesadvancedduringourperiodandlaterprobablyreversed(especiallyinthefaceofAnglicisaQon),suchasthecaseofpre-inflecQonaldevoicingoffricaQves(Maguireetal.forthcoming)

ProporQonofmedial<y>(orange),<th>(grey)and<þ>(yellow)foretymological[ð]bydecade.Blackline=datadensity.

leader trailer

AsoundsubsKtuKonset:<ch>

Tableview(+dataextraction)

Based on all FITS morphemes with <ch>๏ [x] = aucht, dochter, loch …๏ [ç] = nicht, echt, hech …๏ [θ] = bach, lench, muoch, strencht … ๏ [ʧ] = chalys, cheike, chekin, cheis …๏ [k] = chorn, chynde, chechyne …๏ [ð] = worchy, nechtir, skachlaß …

Datacapturetool

Grapho-phonological parsing: Corpus annotation for historical phonology

B. MOLINEAUX1, J. KOPACZYK2, V. KARAISKOS1, D. SMITH1, W. MAGUIRE1, R. ALCORN1 & B. LOS1 1 The University of Edinburgh; 2The University of Glasgow

[ð]-morphemes:thus,there,

those,thence,etc.

[θ]-morphemes:three,thief,think,thaw,thanketc.

The FITS Toolbox

spellings

sounds