proceedings of the 55th annual meeting of the association for … · 2019. 8. 6. · karol...

ComputEL-3

Proceedings of the

3rdWorkshop on the

Use of Computational Methods inthe Study of Endangered

Languages

Volume 1 (Papers)

February 26–27, 2019Honolulu, Hawai‘i, USA

Support:

c©2019 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL)209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-950737-18-5

ii

Preface

These proceedings contain the papers presented at the 3rd Workshop on the Use of ComputationalMethods in the Study of Endangered languages held in Hawai’i at Manoa, February 26–27, 2019. Asthe name implies, this is the third workshop held on the topic—the first meeting was co-located withthe ACL main conference in Baltimore, Maryland in 2014 and the second one in 2017 was co-locatedwith the 5th International Conference on Language Documentation and Conservation (ICLDC) at theUniversity of Hawai‘i at Manoa.

The workshop covers a wide range of topics relevant to the study and documentation of endangeredlanguages, ranging from technical papers on working systems and applications, to reports on communityactivities with supporting computational components.

The purpose of the workshop is to bring together computational researchers, documentary linguists, andpeople involved with community efforts of language documentation and revitalization to take part inboth formal and informal exchanges on how to integrate rapidly evolving language processing methodsand tools into efforts of language description, documentation, and revitalization. The organizers arepleased with the range of papers, many of which highlight the importance of interdisciplinary work andinteraction between the various communities that the workshop is aimed towards.

We received 34 submissions as papers or extended abstracts. After a thorough review process, 12 of thesubmissions were selected for this volume as papers (35%) and an additional 7 were accepted as extendedabstracts which appear in Volume 2 of the workshop proceedings. The organizing committee would liketo thank the program committee for their thoughtful input on the submissions. We are also grateful tothe NSF for funding part of the workshop (award #1550905), and the Social Sciences and HumanitiesResearch Council (SSHRC) of Canada for supporting the workshop through their Connections OutreachGrant #611-2016-0207. We would moreover want to acknowledge the support of the organizers ofICLDC6, in particular Gary Holton, Brad McDonnell and Jim Yoshioka.

ANTTI ARPPE

JEFF GOOD

MANS HULDEN

JORDAN LACHLER

ALEXIS PALMER

LANE SCHWARTZ

MIIKKA SILFVERBERG

iii

Organizing Committee:

Antti Arppe (University of Alberta)Jeff Good (University at Buffalo)Mans Hulden (University of Colorado)Jordan Lachler (University of Alberta)Alexis Palmer (University of North Texas)Lane Schwartz (University of Illinois at Urbana-Champaign)Miikka Silfverberg (University of Helsinki)

Program Committee:

Oliver Adams (Johns Hopkins University)Antti Arppe (University of Alberta)Dorothee Beermann (Norwegian University of Science and Technology)Emily M. Bender (University of Washington)Martin Benjamin (Kamusi Project International)Steven Bird (University of Melbourne)Emily Chen (University of Illinois at Urbana-Champaign)Andrew Cowell (University of Colorado Boulder)Christopher Cox (Carleton University)Robert Forkel (Max Planck Institute for the Science of Human History)Jeff Good (University at Buffalo)Michael Wayne Goodman (University of Washington)Harald Hammarstrom (Max Planck Institute for Psycholinguistics, Nijmegen)Mans Hulden (University of Colorado Boulder)Anna Kazantseva (National Research Council of Canada)Frantisek Kratochvıl (Palacky University)Jordan Lachler (University of Alberta)Terry Langendoen (National Science Foundation)Krister Linden (University of Helsinki)Worthy N Martin (University of Virginia)Michael Maxwell (University of Maryland, CASL)Steven Moran (University of Zurich)Graham Neubig (Carnegie Mellon University)Alexis Palmer (University of North Texas)Taraka Rama (University of Oslo)Kevin Scannell (Saint Louis University)Lane Schwartz (University of Illinois at Urbana-Champaign)Miikka Silfverberg (University of Helsinki)Richard Sproat (Google)Nick Thieberger (University of Melbourne / ARC Centre of Excellence for the Dynamics ofLanguage)

v

Laura Welcher (The Long Now Foundation)Menzo Windhouwer (KNAW Humanities Cluster — Digital Infrastructure)

vi

Table of Contents

An Online Platform for Community-Based Language Description and DocumentationRebecca Everson, Wolf Honoré and Scott Grimm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Developing without developers: choosing labor-saving tools for language documentation appsLuke Gessler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Future Directions in Technological Support for Language DocumentationDaan van Esch, Ben Foley and Nay San . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

OCR evaluation tools for the 21st centuryEddie Antonio Santos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Handling cross-cutting properties in automatic inference of lexical classes: A case study of ChintangOlga Zamaraeva, Kristen Howell and Emily M. Bender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Finding Sami Cognates with a Character-Based NMT ApproachMika Hämäläinen and Jack Rueter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Seeing more than whitespace — Tokenisation and disambiguation in a North Sámi grammar checkerLinda Wiechetek, Sjur Nørstebø Moshagen and Kevin Brubeck Unhammer . . . . . . . . . . . . . . . . . . . 46

Corpus of usage examples: What is it good for?Timofey Arkhangelskiy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

A Preliminary Plains Cree Speech SynthesizerAtticus Harrigan, Antti Arppe and Timothy Mills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A biscriptual morphological transducer for Crimean TatarFrancis M. Tyers, Jonathan Washington, Darya Kavitskaya, Memduh Gökırmak, Nick Howell and

Remziye Berberova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Improving Low-Resource Morphological Learning with Intermediate Forms from Finite State Transduc-ers

Sarah Moeller, Ghazaleh Kazeminejad, Andrew Cowell and Mans Hulden . . . . . . . . . . . . . . . . . . . . 81

Bootstrapping a Neural Morphological Analyzer for St. Lawrence Island Yupik from a Finite-State Trans-ducer

Lane Schwartz, Emily Chen, Benjamin Hunt and Sylvia L.R. Schreiner . . . . . . . . . . . . . . . . . . . . . . 87

vii

Conference Program

Tuesday, February 26th, 2019

08:30–09:00 Imin Conference Center, Asia Room/Arrival, coffee and chat

09:00–09:15 Opening remarks

First morning: Tools and processes for language documentation and descrip-tion, Corpus creation

09:15–09:45 An Online Platform for Community-Based Language Description and Documenta-tionRebecca Everson, Wolf Honoré and Scott Grimm

09:45–10:15 Developing without developers: choosing labor-saving tools for language docu-mentation appsLuke Gessler

10:15–10:45 Future Directions in Technological Support for Language DocumentationDaan van Esch, Ben Foley and Nay San

10:45–11:15 Break

11:15–11:45 Towards a General-Purpose Linguistic Annotation Backend. Graham Neubig,Patrick Littell, Chian-Yu Chen, Jean Lee, Zirui Li, Yu-Hsiang Lin and YuyanZhang

11:45–12:15 OCR evaluation tools for the 21st centuryEddie Antonio Santos

ix

Tuesday, February 26th, 2019 (continued)

12:15–12:45 Building a Common Voice Corpus for Laiholh (Hakha Chin). Kelly Berkson,Samson Lotven, Peng Hlei Thang, Thomas Thawngza, Zai Sung, James Wams-ley, Francis M. Tyers, Kenneth Van Bik, Sandra Kübler, Donald Williamsonand Matthew Anderson

12:45–14:15 Lunch

First afternoon: Language technologies – lexical and syntactic

14:15–14:45 Digital Dictionary Development for Torwali, a less-studied language: Processand Challenges. Inam Ullah

14:45–15:15 Handling cross-cutting properties in automatic inference of lexical classes: A casestudy of ChintangOlga Zamaraeva, Kristen Howell and Emily M. Bender

15:15–15:45 Applying Support Vector Machines to POS tagging of the Ainu Language.Karol Nowakowski, Michal Ptaszynski, Fumito Masui and Yoshio Momouchi

15:45–16:15 Break

16:15–16:45 Finding Sami Cognates with a Character-Based NMT ApproachMika Hämäläinen and Jack Rueter

16:45–17:15 Seeing more than whitespace — Tokenisation and disambiguation in a North Sámigrammar checkerLinda Wiechetek, Sjur Nørstebø Moshagen and Kevin Brubeck Unhammer

17:15–20:00 Dinner

x

Wednesday, February 27th, 2019

09:00–09:30 Arrival, coffee and chat

Second morning: Use (and reuse) of corpora and collections, Language tech-nologies – speech and morphology

09:30–10:00 Using computational approaches to integrate endangered language legacy datainto documentation corpora: Past experiences and challenges ahead. RogierBlokland, Niko Partanen, Michael Rießler and Joshua Wilbur

10:00–10:30 A software-driven workflow for the reuse of language documentation data inlinguistic studies. Stephan Druskat and Kilu von Prince

10:30–11:00 Break

11:00–11:30 Corpus of usage examples: What is it good for?Timofey Arkhangelskiy

11:30–12:00 A Preliminary Plains Cree Speech SynthesizerAtticus Harrigan, Antti Arppe and Timothy Mills

12:00-12:30 A biscriptual morphological transducer for Crimean TatarFrancis M. Tyers, Jonathan Washington, Darya Kavitskaya, Memduh Gökırmak,Nick Howell and Remziye Berberova

12:30–14:00 Lunch

xi

Wednesday, February 27th, 2019 (continued)

Second afternoon: Language technologies – speech and morphology

14:00-14:30 Improving Low-Resource Morphological Learning with Intermediate Forms fromFinite State TransducersSarah Moeller, Ghazaleh Kazeminejad, Andrew Cowell and Mans Hulden

14:30–15:00 Bootstrapping a Neural Morphological Generator from Morphological Ana-lyzer Output for Inuktitut. Jeffrey Micher

15:00–15:30 Bootstrapping a Neural Morphological Analyzer for St. Lawrence Island Yupik froma Finite-State TransducerLane Schwartz, Emily Chen, Benjamin Hunt and Sylvia L.R. Schreiner

15:30–16:00 Break

16:00–17:00 Discussions, Looking ahead: CEL-4 and new ACL Special Interest Group

xii

Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages: Vol. 1 Papers, pages 1–5,Honolulu, Hawai‘i, USA, February 26–27, 2019.

An Online Platform for Community-Based Language Description andDocumentation

Rebecca EversonIndependent

[email protected]

Wolf HonoreYale University

[email protected]

Scott GrimmUniversity of Rochester

[email protected]

AbstractWe present two pieces of interlocking technol-ogy in development to facilitate community-based, collaborative language description anddocumentation: (i) a mobile app wherespeakers submit text, voice recordings and/orvideos, and (ii) a community language por-tal that organizes submitted data and providesquestion/answer boards whereby communitymembers can evaluate/supplement submis-sions.

1 Introduction

While engagement of language communities, anddiverse members thereof, is crucial for adequatelanguage documentation and description, this isoften a challenging task given the finite resourcesof field linguists. We present a technological plat-form designed to accelerate and make more in-clusive the process of documenting and describ-ing languages with the goal of enabling languagecommunities to become researchers of their ownlanguages and curators of language-based facetsof their culture.

We are currently developing two pieces of inter-locking technology: (i) a mobile app whose func-tionality and simplicity is reminiscent of Whats-App (which is widespread in, for instance, WestAfrica) through which speakers submit text, voicerecordings and/or videos, and (ii) an communitylanguage portal that, for each language, organizesand displays submitted data and provides dis-cussion and question/answer boards (in the styleof Quora or Stack Exchange) where communitymembers can evaluate, refine, or supplement sub-missions. Together, these permit field linguists,community educators and other stakeholders toserve in the capacity as “language communitycoordinator”, assigning tasks that are collabora-tively achieved with the community. This technol-ogy shares similarities with recent efforts such as

Aikuma (Bird et al., 2014) or Kamusi (Benjaminand Radetzky, 2014); however, the focus is on de-veloping a limited range of functionalities with anemphasis on simplicity to engage the highest num-bers of community members. This in turn will ac-celerate the most common tasks facing a languagedescription and/or documentation project and doso at a technological level in which all communitymembers who are capable of using a mobile phoneapp may participate.

In this paper, we first exemplify the mobile appwith the process of developing language resourceswith developing a lexicon. Section 2 contains anoverview of the application, Section 3 is a discus-sion of the different user interfaces, and Section 4gives the implementation details. Finally, we ad-dress extensions currently under development inSection 5.

2 Language Resource Development

Lexica and other resources are developed throughthe interaction between coordinators and contrib-utors. The coordinator, who could be a linguistand/or community member, puts out queries forinformation and accepts submissions through aweb console written in TypeScript using the Re-act framework. Contributors use the mobile inter-face, built with React Native, a popular JavaScriptframework for mobile development, to add wordsin both spoken and written form, along with pic-tures. The accepted submissions are used to au-tomatically generate and update interactive onlineresources, such as a lexicon.

For lexicon development, the platform is able toaccommodate a wide variety of scenarios: mono-lingual or bilingual, textual and/or audio alongwith possible contribution of pictures or videofiles. Thus, in one scenario a coordinator work-ing on a language for which textual submissions

1

are infeasible could send requests in the formof recorded words soliciting “folk definitions” ofthose words (Casagrande and Hale, 1967; Laugh-ren and Nash, 1983), that is, the speakers them-selves supply the definition of a word, typicallyproducing more words that can be requested by thecoordinator. Alternately, semantic domains mayserve as the basis of elicitation in the style of Al-bright and Hatton (2008) or in relation to resourcessuch as List et al. (2016). Built into our approachis the ability to record language variation: For lex-icon entries, multiple definitions and forms can beprovided by different speakers, which can then becommented and/or voted on, until a definition (ordefinitions) is generally accepted and variation isproperly recorded.

Analogous processes allow speakers to con-tribute varieties of language-based cultural con-tent: folktales, oral histories, proverbs, songs,videos explaining cultural practices and dailytasks (cooking, sewing, building houses, etc.). Ac-cordingly, this method may be used to develop(i) data repositories and resources for researchersin linguistics and allied fields, especially thosetouching on studies in language and culture, and(ii) educational materials and other materials thatmay benefit the community.

3 User Interface

For the purposes of this section, we focus on thetask of developing a lexicon on the basis of a pre-determined wordlist, such as the SIL ComparativeAfrican Word List (Snider and Roberts, 2004), acommon scenario for a fieldworker working on anunder-described language for which the speakersalso speaks an administrative language such as En-glish or French.

Users of the application fall into three cate-gories: coordinators, contributors, and consumers.A coordinator can be a member of the languagecommunity, such as an elder or group of elders,or she might be a field linguist working withinthe community. Coordinators handle adding newwords to translate, assigning words to contribu-tors, and accepting or rejecting submitted trans-lations. Coordinators also have the ability to as-sign contributors and words to subgroups. Thisis useful if there are specialized vocabularies spe-cific to only a portion of the community, for exam-ple, hunters who use a different vocabulary whilehunting. A contributor is a community member

who has been chosen by the coordinators to com-plete the translation work. Finally, a consumeris anyone who has access to the community lan-guage portal that is created from the submissionsaccepted by the coordinators.

We see the development of the community lan-guage portal and the presentation of speakers shar-ing their language as key for motivating contin-ued contributions. We expect that the varietiesof language-based cultural content speakers con-tribute as part of the documentation activities, e.g.,folk tales or oral histories, will be a key motivatorfor community members to use and contribute tothe platform. We provide functionality for con-sumers to also become contributors. For instance,next to contributed videos, a button allows con-sumers to contribute their own video. Contentso contributed will not be directly accepted to thedatabase, but will require approval from the coor-dinator, so as to provide a check on inappropriatecontent.

The following subsections give sample user sto-ries for each type of user. As will be detailed inSection 4, users will sign in with a Google ac-count, which provides a straightforward solutionfor user authentication. In the following user sto-ries we use English and Konni [kma] as our exam-ple languages, though the application can work forany pairing.

3.1 Contributor

A new contributor joins the language descrip-tion and documentation effort. They belong toa Ghanaian community that speaks English andKonni, which is the language they want to describeand document. They open the mobile applicationand sign in with their Google account credentials.Upon the initial sign in, the contributor must pro-vide responses to secure consent in accordancewith an IRB protocol. Then the contributor is pre-sented with a demographic survey. Once success-fully logged in, they see the home screen, whichdisplays their assignments, e.g., a list of semanticprompts or words in English (see Figure 1). Theyselect a word from the list by tapping it. They arethen taken to a form with fields for the transla-tion of the word in Konni, a sample sentence usingthe word in English, and a translation of the sen-tence in Konni. There are also two fields for au-dio recordings of the word and sentence in Konni.When the user selects these fields they see a screen

2

Figure 1: Application home page.

with Start and Stop buttons. Pressing Start beginsrecording using the phone’s microphone and Stopends the recording and saves the audio as a WAVfile. When they have filled in all of the fields, theypress the Submit button at the bottom of the screenand are taken back to the home screen. Now theword that they just translated is green and has acheck-mark next to it (see Figure 2). They nowalso see a button at the bottom of the home screenthat reads Submit 1 translation(s). When they areready to submit their completed translations, theypress that button. If their phone is currently con-nected to the Internet, the translated words will beremoved from the list and new words will appear.If not, they can continue translating or wait to en-ter an area with WiFi.

Figure 2: Word translation page.

3.2 Coordinator

A coordinator wants to review some translationsand assign more words to contributors. He startsby opening a web browser and navigating to hisdocumentation project’s web page. After loggingin with his Google account credentials he sees thata contributor has just submitted three new trans-lations that need to be reviewed. The coordinatorreviews all of the translations and decides that twoof them are ready for publishing, but one of themneeds a better example sentence. He rejects thetranslation with a note explaining why it was re-jected, and the word is put back into circulationautomatically by the system. The two acceptedwords will no longer be added to users’ word listsunless manually added back into the database.

3

3.3 Consumer

A consumer who visits the Community LanguagePortal navigates to a part of the website providinga picture gallery of everyday and cultural artifacts.As she looks through the pictures and their cap-tions, she notices that the word for water pitcherhas a different form than is used in her village.She clicks on the ‘Contribute’ button, and is ableto either leave a comment about the caption or gothrough the sign-up process to become a contribu-tor. The coordinator reviews her submission, and,if appropriate, adds her to the contributors, latersending out more invitations to contribute.

4 Implementation Details

Users of the application only see information rele-vant to their current task, but all user and languagedata is stored and managed in a remote database.The application communicates with a CherryPyweb server, which acts as the glue between thesetwo components. For example, consider the Konnicontributor trying to submit some translations. Be-fore doing anything, the user must first sign in.To avoid having to manage passwords we requireusers to connect to the application using a Googleaccount. Then the application communicates withGoogle’s Authentication Server using the OAuth2.0 protocol to retrieve an authentication tokenthat uniquely identifies the user. When the con-tributor presses Submit, the phone will first storethe translation information locally and wait untilit enters an area with a strong Internet connec-tion. Once connected, it then sends an HTTP re-quest to the web server containing the translationsto be submitted (multiple translations are batchedtogether for efficiency), as well as the authentica-tion token. Upon receiving the request, the serveruses the token to verify that the user exists and hasthe proper permissions. It then adds the transla-tions to the database and queries it for a new set ofuntranslated words. Finally the server responds tothe application indicating that the submission wassuccessful and containing the new words.

At the lowest level of the application stack isa MySQL database that is responsible for storinguser information and translations. It consists ofthree tables: words for the words that are to betranslated, users for all the application users,and translations for all submitted transla-tions, both reviewed and unreviewed. An entry inwords contains the word itself, as well as other

grammatical information, such as part-of-speech.The users table contains users’ names, theirGoogle authentication tokens, their roles (contrib-utor or coordinator), and the maximum number ofwords that can be assigned to them. Each row intranslations consists of the word translatedto the target language, a sentence containing theword in the source language, the same sentencetranslated, paths to audio files containing record-ings of the word and sentence, and a flag indicat-ing whether the translation has been accepted.

This is a natural division of the data that al-lows the tables to grow independently of one an-other (e.g. adding a new user only affects theusers table). However, we often want to makequeries that depend on information from multi-ple tables, such as searching for words with noaccepted translations that are assigned to fewerthan three users. To facilitate these searches wealso introduce links between the tables. A wordcan have multiple translations, but each translationcorresponds to exactly one word, so words andtranslations have a 1-many relation. Sim-ilarly, because a user may have many submittedtranslations, and each translation was submitted byone user, users and translations also havea 1-many relation. On the other hand, a word canbe assigned to several users, and a user may havemultiple assigned words, so words and usershave a many-many relation.

5 Current Developments

With this groundwork laid on the application, weare expanding other aspects of the project. Sinceone of our primary goals is to engage communitymembers, we are pursuing more ways for them toengage with the software and data. To that end, weare designing a discussion board for raising ques-tions about accepted materials, asking for clarifi-cation on words, debates, polls, and so forth. Any-one from the community will be able to use thissoftware. In short, we aim to leverage successfulexamples of online community building to furtherlanguage description and documentation.

It is possible that contributors use slightly dif-ferent lexica within the community. For exam-ple, in a community that has a designated group ofhunters, the hunters might use different words inthe field that community members who stay in thevillage most of the time don’t know. In this exam-ple, a coordinator might want to gather lexical data

4

from both groups, so they should be able to markwhich users belong to which sub-communities inthe database. At the moment, we have back-endfunctionality for this sub-grouping of words andusers, but no way for a coordinator to interact withthis feature from the application. In the meantime,words are assigned to users automatically.

Having a way for a coordinator to assign wordsto specific users will also be an important feature.It is very likely that contributors will sometimesbe working in areas with a lot of background noise,and not everyone will have a phone that can recordhigh-quality audio. Giving coordinators the abil-ity to reassign words to users who they know canrecord with better sound quality will ensure high-quality data, which can then be used later in lin-guistic analysis.

ReferencesEric Albright and John Hatton. 2008. Wesay: A tool

for engaging native speakers in dictionary build-ing. Documenting and revitalizing Austronesianlanguages, (1):189.

Martin Benjamin and Paula Radetzky. 2014. Small lan-guages, big data: Multilingual computational toolsand techniques for the lexicography of endangeredlanguages. In Proceedings of the 2014 Workshop onthe Use of Computational Methods in the Study ofEndangered Languages, pages 15–23.

Steven Bird, Florian R Hanke, Oliver Adams, and Hae-joong Lee. 2014. Aikuma: A mobile app for col-laborative language documentation. In Proceedingsof the 2014 Workshop on the Use of ComputationalMethods in the Study of Endangered Languages,pages 1–5.

J.B. Casagrande and K.L. Hale. 1967. Semantic Re-lationships in Papago Folk-Definitions. In Studiesin Southwestern Ethnolinguistics, pages 165–193.Mouton, The Hague.

M. Laughren and D. Nash. 1983. Warlpiri dictio-nary project: Aims, method, organization and prob-lems of definition. Papers in Australian Linguistics,15(1):109–133.

Johann-Mattis List, Michael Cysouw, and RobertForkel. 2016. Concepticon: A resource for the link-ing of concept lists. In Proceedings of the Tenth In-ternational Conference on Language Resources andEvaluation, May 23-28, 2016, Portoroz, Slovenia,pages 2393–2400.

Keith Snider and James Roberts. 2004. Sil comparativeafrican wordlist (silcawl). JWAL, 31(2):73–122.

5


Developing without developers: choosing labor-saving tools for languagedocumentation apps

Luke D. GesslerDepartment of Linguistics

Georgetown [email protected]

Abstract

Application software has the potential togreatly reduce the amount of human laborneeded in common language documentationtasks. But despite great advances in the matu-rity of tools available for apps, language docu-mentation apps have not attained their full po-tential, and language documentation projectsare forgoing apps in favor of less specializedtools like paper and spreadsheets. We arguethat this is due to the scarcity of software de-velopment labor in language documentation,and that a careful choice of software devel-opment tools could make up for this laborshortage by increasing developer productiv-ity. We demonstrate the benefits of strategictool choice by reimplementing a subset of thepopular linguistic annotation app ELAN usingtools carefully selected for their potential tominimize developer labor.

1 Introduction

In many domains like medicine and finance, ap-plication software has dramatically increased pro-ductivity, allowing fewer people to get more done.This kind of labor reduction is sorely needed inlanguage documentation, where there is not nearlyenough labor to meet all the demand in the worldfor language documentation.

There is every reason to think that applicationsoftware (“apps”) could also help language docu-mentation (LD) practitioners get their work donemore quickly. But despite the availability of sev-eral LD apps, many practitioners still choose touse paper, spreadsheets, or other generic tools(Thieberger, 2016). Why aren’t practitioners us-ing apps?

1.1 Perennial problems in LD apps

The simplest explanation is that the apps are nothelpful enough to justify the cost of learning and

adjusting to them. While it is clear by now fromprogress in natural language processing that it istechnically possible to make laborious tasks suchas glossing and lexicon management less time-consuming by orders of magnitude, flashy featureslike these turn out to be only one part of the prac-titioner’s decision to adopt an app. It seems thatother factors, often more mundane but no less im-portant, can and often do nullify or outweigh thesebenefits in today’s apps. Three kinds of problemscan be distinguished which are especially impor-tant for LD apps.

First, there is the unavoidable fact that the userwill need to make unwanted changes to their work-flow to work within an app. By virtue of its struc-ture, an app will always make some tasks prerequi-sites for others, or at least make some sequences oftasks less easy to perform than others. Ideally, anapp designer will be successful in identifying thetask-sequences that are most likely to be preferredby their user and ensure that these workflows aresupported and usable within the app. But no mat-ter how clever they are, their app will always pushsome users away from their preferred workflow.For app adoption, this presents both a barrier toentry and a continuous cost (if a user is never ableto fully adapt themselves to the app’s structure).

Second, app developers often make simplifyingassumptions which sacrifice generality for devel-opment speed. For instance, an app’s morphologi-cal parser might not be capable of recognizing dia-critics as lexically meaningful, which would limitthe app’s usability, e.g. for the many languageswhich represent lexical tone with diacritics.

Third, an app might lack a feature altogether.An app might only work on a particular operat-ing system like Windows, or might not work with-out an internet connection, both of which mightsingle-handedly make an app unusable. In anothervein, it might be critical for a project to have texts

6

be annotated with a syntactic parse, or for all textsto be exportable into a particular format, like as anELAN (Wittenburg et al., 2006) or FLEx (Moe,2008) project. If a user cannot perform a cru-cial kind of linguistic annotation, or move theirdata into another app which is critically impor-tant for them, they are likely to not use the app atall, no matter how good it might be at segmentingtext into tokens, morphologically parsing words,or managing a lexical inventory. Indeed, the fea-tures which might be absolutely necessary for anygiven practitioner are in principle at least as nu-merous as the number of formalisms and levels ofanalysis used by all linguists, and it is to be ex-pected that an app will be inadequate for a greatnumber of potential users.

1.2 The need for more developer labor

Whatever the exact proportions of these prob-lems in explaining the under-adoption of apps, itseems clear that they all could be solved relativelystraightforwardly with more software engineeringlabor. Much of the hard work of inventing thelinguistic theories and algorithms that would en-able an advanced, labor-saving LD app has alreadybeen done. What remains is the comparativelymundane task of laying a solid foundation of ev-eryday software on which we might establish thesepowerful devices, and the problem is that in LD,software engineering labor is—as in most placesin science and the academy—markedly scarce.

With a few notable exceptions1, LD apps havelargely been made by people (often graduate stu-dents) who have undertaken the project on top ofthe demands of their full time occupation. Con-sidering how in industry entire teams of highly-skilled full-time software engineers with amplefunding regularly fail to create successful apps, itis impressive that LD apps have attained as muchsuccess as they have. Nevertheless, most LD soft-ware will continue to be produced only in sucharid conditions, and so the question becomes oneof whether and how it might be possible to cre-ate and maintain a successful app even when theonly developers will likely be people like graduatestudents, who must balance the development workwith their other duties.

While pessimism here would be understand-able, we believe it is possible to create an app that

1FLEx, ELAN, and Miromaa all have at least one full-time software engineer involved in their development.

is good enough to be worth using for most prac-titioners even under these circumstances. Whilethere is still not much that is known with certaintyabout productivity factors in software engineering(Wagner and Ruhe, 2018), many prominent soft-ware engineers believe that development time canbe affected dramatically by the tools that are usedin the creation of an app (Graham, 2004).

Apps depend on existing software libraries, andthese libraries differ in dramatic ways: somedatabases are designed for speed, and others aredesigned for tolerating being offline; some user in-terface frameworks are designed for freedom andextensibility, and others double down on opinion-ated defaults that are good enough most of thetime; some programming languages are alwaysrunning towards the cutting edge of programminglanguage research, and others aim to be eminentlypractical and stable for years to come. It is obvi-ous that these trade-offs might have great conse-quences for their users: a given task might take amonth with one set of tools, and only a few dayswith another.

We should expect, then, that the right combina-tion of tools could allow LD app developers to bemuch more productive. If this is true, then if onlywe could choose the right set of tools to work with,we might be able to overcome the inherent lack ofdevelopment labor available for LD apps.

2 Cloning ELAN

To put this hypothesis to the test, we recreatedthe rudiments of an app commonly used for LD,ELAN (Wittenburg et al., 2006). Choosing an ex-isting app obviated the design process, saving timeand eliminating a potential confound. ELAN waschosen in particular because of its widespread usein many areas of linguistics, including LD, and be-cause its user interface and underlying data struc-tures are complicated. We reasoned that if our ap-proach were to succeed in this relatively hard case,we could be fairly certain it could succeed in eas-ier ones, as well.

With an eye to economic problems of LD appdevelopment outlined in section 1.2, we first deter-mined what features we thought an advanced LDapp would need to prioritize in order to make de-velopment go quickly without compromising onquality. Then, we chose software libraries in ac-cord with these requirements.

7

2.1 Requirements and tool choices

There were four requirements that seemed mostimportant to prioritize for an LD app. These were:(1) that the app be built for web browsers, (2) thatthe app work just as well without an internet con-nection, (3) that expected “maintenance” costs beas low as possible, and (4) that graphical user in-terface development be minimized and avoided atall costs.

2.1.1 Choice of platform: the browser

15 years ago, the only platform that would havemade sense for an LD app was the desktop.These days, there are three: the desktop, the webbrowser, and the mobile phone. While the mobilephone is ubiquitous and portable, it is constrainedby its form factor, making it unergonomic forsome kinds of transcription and annotation. Thedesktop platform has the advantage of not expect-ing an internet connection, and indeed the mostwidely used LD apps such as FLEx and ELANhave been on the desktop, but it is somewhat dif-ficult to ensure that a desktop app will be usableon all operating systems, and the requirement thatthe user install something on their machine can beprohibitively difficult.

The web browser resolves both of these diffi-culties: no installation is necessary, since usingan app is as easy as navigating to a URL, and theproblem of operating system support is taken careof by web browser developers rather than app de-velopers.

2.1.2 Offline support: PouchDB

The notable disadvantage of web browsers com-pared to these other platforms, however, is thatthe platform more or less assumes that you willhave an internet connection. The simple problemis that language documentation practitioners willoften not have a stable internet connection, and ifan app is unusable without one, they will never usethe app at all.

Fortunately, there are libraries that can enable abrowser app to be fully functional even without aninternet connection. The main reason why a typ-ical browser app needs an internet connection isthat the data that needs to be retrieved and changedis stored in a database somewhere on the other endof a network connection. But there are databasesthat can be installed locally, removing the need foran uninterrupted internet connection.

Figure 1: PouchDB imitates a traditional database, butit exists alongside the app in a user’s local machinerather than on a remote server. When an internet con-nection becomes available, PouchDB can sync with theremote server and other clients, meaning that sharingand securely backing up data is still possible.

The most advanced database for this purposeis PouchDB. PouchDB is a browser-based im-plementation of the database system CouchDB.This means that PouchDB acts just like a nor-mal database would, except it is available lo-cally instead of over a network connection. Andwhen an internet connection does become avail-able, PouchDB is able to share changes with aremote instance of CouchDB, which then makesthem available to other collaborators, as seen infigure 1. PouchDB and CouchDB were speciallydesigned to make this kind of operation easy, andretrofitting the same behavior with a more tradi-tional choice of tools would be extremely timeconsuming.

An added benefit of adopting this database strat-egy is that it dramatically reduces the need forclient-server communication code. Normally, anentire layer of code is necessary to package, trans-mit, and unpackage data between the client and theserver, which most of the time amounts to noth-ing more than boilerplate. With this approach, allof that is taken care of by the replication proto-col which allows PouchDB instances to sync withCouchDB instances, and the server code that is re-quired by a small set of cases (e.g. triggering anotification email) can still exist alongside this ar-chitecture.

We are not the first to realize PouchDB’s po-tential for writing an LD app: FieldDB (akaLingSync) (Dunham et al., 2015) uses PouchDB.

8

2.1.3 Minimizing maintenance:ClojureScript

Software requires some degree of modification astime goes on. Almost all software relies on othersoftware, and when breaking changes occur in adependency, an app must also have its code mod-ified to ensure that it can continue operating withthe latest version of its dependency. For instance,much of the Python community is still migratingtheir code from Python 2 to Python 3, an error-prone process that has consumed many thousandsof developer-hours worldwide.

This kind of maintenance is, at present,quite common for browser apps: web browsers,JavaScript, and JavaScript frameworks are evolv-ing at a quick pace that often requires teams toconstantly modify their code and compilers, andsometimes to even abandon a core library alto-gether in favor of a better one, which incurs amassive cost as the team excises the old libraryand stitches in the new one. These maintenancecosts are best avoided if possible, as they con-tribute nothing valuable to the user: in the bestcase, after maintenance, no change is discernible;and if the best case is not achieved, the app breaks.

We surveyed programming languages that canbe used for developing browser applications andwere impressed by the programming languagecalled ClojureScript. ClojureScript is a Lisp-family functional programming language thatcompiles to JavaScript. The language is remark-ably stable: ClojureScript written in 2009 stilllooks essentially the same as ClojureScript writ-ten in 2018. The same cannot often be said forJavaScript code.

A full discussion of the pros and cons of Clo-jureScript is beyond the scope of this paper, but itmight suffice to say that ClojureScript offers verylow maintenance costs and powerful language fea-tures at the cost of a strenuous learning process.Unlike Python, Java, or JavaScript, ClojureScriptis not object-oriented and does not have ALGOL-style syntax, making it syntactically and seman-tically unfamiliar. This is a serious penalty, as itmay reduce the number of number of people whocould contribute to the app’s development. How-ever, this may turn out to be a price worth paying,and as we will see in section 2.3.2, this disadvan-tage can be mitigated somewhat by allowing usersand developers to extend the app using more fa-miliar programming languages.

Figure 2: Trafikito, an app that uses Material-UI to sup-port for mobile and desktop clients.

2.1.4 Minimizing GUI development:Material-UI

Creating user interface components for thebrowser from scratch is extremely time consum-ing. In the past several years, however, numer-ous comprehensive and high-quality open sourcecomponent libraries for the browser have been re-leased. Most parts of an LD app can be servedwell by these off-the-shelf components, making itan obvious choice to outsource as much UI devel-opment as possible to them.

The only concern would be that these librariesmight be abandoned by their maintainers. Whilethis is always a possibility, the communities thatmaintain the most prominent among these librariesare very active, and in the event that their currentmaintainers abandoned them, it seems likely thatthe community would step up to take over mainte-nance.

We chose to work with Material-UI, a library ofcomponents for Facebook’s React UI frameworkthat adheres to Google’s Material Design guide-lines. Beyond providing a set of high-quality com-ponents, with the v1.0 release of Material-UI, allcomponents have been designed to work well onmobile clients, which has the potential to make anapp developed with Material-UI usable on mobilephones at no extra development cost for the appdeveloper, an incredible timesaving benefit. (Seefigure 2.)

2.2 Results

Using these tools, we were able to create a browserapp that implements a subset of ELAN in about 3

9

Figure 3: A screenshot of ELAN, above, and EWAN, below, for a given sample project.

10

weeks of development time2. Our implementationprocess basically confirmed our hypothesis: thatour careful choice of tools made developers moreproductive, and vastly simplified the implementa-tion of certain features that are critical to LD apps.

ELAN is an app that allows users to annotatea piece of audio or video along multiple aligned“tiers”, each of which represents a single levelof analysis for a single kind of linguistic or non-linguistic modality. For instance, in the exampleproject in figure 3 (shown in both ELAN and ourreimplementation, EWAN), for one of the speak-ers, there is a tier each for the speaker’s sentence-level utterance, the speaker’s utterance tokenizedinto words, IPA representations of those words,and the parts of speech for those words; and thereis another tier for describing the speaker’s gestic-ulations. Users are able to configure tiers so thate.g. a child tier containing parts of speech maynot have an annotation that does not correspond toa word in the parent tier, and the app coordinatesthe UI so that the annotation area always stays insync with the current time of the media player.

We implemented ELAN’s behavior to a pointwhere we felt we had been able to evaluate ourhypothesis with reasonable confidence. In the end,this amounted to implementing the following fea-tures: (1) a basic recreation of the ELAN interface,(2) importing and exporting ELAN files, (3) read-only display and playback of projects, (4) editingof existing annotations, and (5) completely offlineoperation with sync capabilities. Most notablymissing from this list is the ability to create newtiers and annotations, but we did not feel that im-plementing these features would have significantlychanged our findings.

The creation of the interface with Material-UIwas straightforward. The only UI components thatinvolved significant custom development were thetier table and the code that kept the table and thevideo in sync: everything else, including the formsthat allowed users to create new projects and im-port existing projects, was simply made with theexpected Material-UI components.

ClojureScript’s XML parsing library made im-porting and exporting ELAN project files astraightforward matter. Internally, EWAN usesthe ELAN project file format as its data model,

2Our app can be seen athttps://lgessler.com/ewan/ andour source code can be accessed athttps://github.com/lgessler/ewan.

so there was no additional overhead associatedwith translating between EWAN and ELAN for-mats. To ensure the validity of the data un-der any changes that might be made in EWAN,we used ClojureScript’s cljs.spec library to en-force invariants3. cljs.spec is one language fea-ture among many that we felt made ClojureScript agood choice for enhancing developer productivity:cljs.spec allows developers to declaratively en-force data model invariants, whereas in most otherlanguages more verbose imperative code wouldhave to be written to enforce these invariants.

Projects were stored in PouchDB, and all theexpected benefits were realized: an internet con-nection is no longer needed. Indeed, on the demodeployment of the app, there is not even a re-mote server to accompany it: once the app hasbeen downloaded, users may use the app in its en-tirety without an internet connection. With ease,we were able to set up an instance of CouchDB ona remote server and sync our EWAN projects withit, even in poor network conditions.

2.3 Extensions

At the point in development where we stopped,there were features which we were close enoughto see with clarity, even though we did not imple-ment them. These are features we expect wouldbe easily within reach if a similar approach weretaken to creating a LD app.

2.3.1 Real-time collaborationGoogle Docs-style real-time collaboration, wheremultiple clients collaborate in parallel on a sin-gle document, is well supported by our choice ofPouchDB. Implementation is not entirely trivial,but the hard part—real-time synchronization with-out race conditions or data loss—is handled byCouchDB’s replication protocol. It is debatablehow useful this feature might be in an LD app, butif it were needed, it would be easily within reach.

2.3.2 User scripts with JavaScriptPower users delight in the opportunity to use ascripting language to tackle tedious or challeng-ing tasks. Because this app was built in the webbrowser, we already have a language with excel-lent support at our disposal that was designed to

3An invariant in this context is some statement about thedata that must always be true. For example, one might wantto require that an XML element referenced by an attribute inanother element actually exists.

11

Figure 4: Examples of what a JavaScript scripting in-terface for EWAN might look like. An API imple-mented in ClojureScript with functions intended for usein JavaScript that can modify EWAN’s data structuresis made available in a global namespace, and powerusers can use the scripting functions to perform bulkactions and customize their experience.

be easy for users to pick up and write small snip-pets with: JavaScript.

Since ClojureScript compiles to JavaScript andis designed for interoperation, it is very easy tomake such a system available. As shown in figure4, a special module containing user-facing func-tions that are meant to be used from a JavaScriptconsole could be created. This could be used foranything from bulk actions to allowing users tohook into remote NLP services (e.g. for morpho-logical parsing, tokenization, etc.).

The possibilities are endless with a user-facingAPI, and a thoughtfully constructed scripting APIcould do much to inspire and attract users. It’sworth noting that this feature was made possibleby our choice of platform: had we not been usingthe browser, we would have had either implementan entire scripting language or embed an existingone (like Lua) into the app.

2.3.3 ExtensionsAs noted in section 1.1, there will always be for-malisms or levels of analysis that will not have ro-bust support in an app. Normally, this pushes userswho really need such support to either use anotherapp or revert to a more generic medium.

To try to tackle this issue, an app structuredlike EWAN could be enriched with API’s to al-low users to customize deeper parts of the app, asshown in figure 5. Suppose that a user is a se-manticist who has designed a new semantic pars-ing formalism, and wants to be able to annotate

Figure 5: A high-level representation of EWAN’s APIlayers and how UI and business logic layers could beextended if extension functions at the DB and businesslogic layers were made.

texts with it. This would require the creation of atleast a new UI. If the user is comfortable with writ-ing JavaScript, the user could use these extensionAPI’s to implement support for their own formal-ism, provided these extension API’s are powerfuland approachable enough for use by novice pro-grammers. The user could then not only benefitthemselves, but also share the extension with thecommunity.

If this were done well, a major problem driv-ing the continued fragmentation of the LD applandscape would be solved, and the LD commu-nity could perhaps begin to converge on a commonfoundation for common tasks in LD.

3 Conclusion

Because of the economic conditions that are en-demic to software development in language docu-mentation, app developers cannot reasonably hopeto succeed by taking software development ap-proaches unmodified from industry, where devel-oper labor is much more abundant. We have ar-gued these economic conditions are the major rea-son why LD apps have not realized their full po-tential, and that the way to solve this problem isto be selective in the tools that are used in mak-ing LD apps. We have shown that choice of toolsdoes indeed affect how productive developers canbe, as demonstrated by our experience recreatinga portion of the app ELAN.

It is as yet unclear whether our choices wereright—both our choices in which requirements to

12

prioritize, and in which tools to use to addressthose requirements. Offline support seems secureas a principal concern for LD apps, but perhapsit might actually be the case that it is better tochoose a programming language that is easy tolearn rather than easy to maintain. What we hopeis clear, however, is that choice of tools can af-fect developer productivity by orders of magni-tude, and that the success of language documenta-tion software will be decided by how foresightedand creative its developers are in choosing toolsthat will minimize their labor.

Acknowledgments

Thanks to Mans Hulden, Sarah Moeller, AnishTondwalkar, and Marina Bolotnikova for helpfulcomments on an earlier version of this paper.

ReferencesJ. Dunham, A. Bale, and J. Coon. 2015. LingSync:

web-based software for language documentation. InProceedings of the 4th International Conference onLanguage Documentation Conservation.

P. Graham. 2004. Hackers and Painters: Essays on theArt of Programming. O’Reilly & Associates, Inc.,Sebastopol, CA, USA.

R. Moe. 2008. FieldWorks Language Explorer 1.0.SIL Forum for Language Fieldwork.

N. Thieberger. 2016. Language documen-tation tools and methods summit reporthttp://bit.ly/LDTAMSReport.

S. Wagner and M. Ruhe. 2018. A systematic review ofproductivity factors in software development. Com-puting Research Repository, abs/1801.06475.

P. Wittenburg, H. Brugman, A. Russel, A. Klassmann,and H. Sloetjes. 2006. ELAN: a professional frame-work for multimodality research. In Proceedings ofLREC 2006, Fifth International Conference on Lan-guage Resources and Evaluation.

13


Future Directions in Technological Support for Language Documentation

Daan van EschGoogle

[email protected]

Ben FoleyThe University of Queensland,ARC Centre of Excellence for

the Dynamics of [email protected]

Nay SanStanford University,

Australian National University,ARC Centre of Excellence for

the Dynamics of [email protected]

AbstractTo reduce the annotation burden placed on lin-guistic fieldworkers, freeing up time for deeperlinguistic analysis and descriptive work, thelanguage documentation community has beenworking with machine learning researchers toinvestigate what assistive role technology canplay, with promising early results. This pa-per describes a number of potential follow-up technical projects that we believe would beworthwhile and straightforward to do. We pro-vide examples of the annotation tasks for com-puter scientists; descriptions of the technologi-cal challenges involved and the estimated levelof complexity; and pointers to relevant litera-ture. We hope providing a clear overview ofwhat the needs are and what annotation chal-lenges exist will help facilitate the dialogueand collaboration between computer scientistsand fieldwork linguists.

1 The Transcription and AnnotationChallenge

Language documentation projects typically yieldmore data than can be reasonably annotated andanalyzed by the linguistic fieldworkers that collectthe data. For example, transcribing one hour ofaudio recordings can take 40 hours or more (Du-rantin et al., 2017), and potentially even longerwhere the language is severely under-studied orunder-described. To reduce the annotation bur-den placed on linguistic fieldworkers, freeing uptime for deeper linguistic analysis and descriptivework, the language documentation community hasbeen working with machine learning (ML) re-searchers to investigate what role technology canplay. As a result, today enterprising fieldwork lin-guists can use a number of toolkits to help ac-celerate the annotation process by automaticallyproposing candidate transcriptions. Such toolkitsinclude CoEDL Elpis (Foley et al., 2018), Perse-phone (Adams et al., 2018; Michaud et al., 2018),and SPPAS (Bigi, 2015).

2 A Technological DevelopmentRoadmap

These toolkits are already showing promising re-sults across a number of languages. Arguably,however, their current incarnation only scratchesthe surface of what is technically possible, eventoday. In this paper, we aim to describe a num-ber of potential extensions to these systems, whichwe believe would be reasonably straightforwardto implement, yet which should significantly im-prove the quality and usefulness of the output.Specifically, we look at the annotation steps inthe language documentation process beyond justphonemic or orthographic transcription, and de-scribe what assistive role technology can play formany other steps in the language documentationprocess, such as automatically proposing glossingand translation candidates.

To be clear, this paper does not actually describecomplete systems that have been implemented.Rather, our goal is to describe technical projectsthat we believe would be worthwhile and straight-forward to do. We provide examples of the anno-tation tasks for computer scientists; descriptionsof the technological challenges involved and theestimated level of complexity; and pointers to rel-evant literature on both the computational and thelinguistic side. Our hope is that this overview ofwhat the needs are and what annotation challengesexist will help facilitate the dialogue and collabo-ration between computer scientists and fieldworklinguists.

Our higher-level vision is for a toolkit that letsfieldwork linguists process data at scale, with lim-ited technical knowledge needed on the side ofthe user. This toolkit should make it as easy aspossible to apply the most relevant well-knowntechniques in natural language processing and ma-chine learning to any language. This will likelyinvolve providing a simple, pre-configured graph-

14

ical user interface; for the purposes of this paper,however, we focus on the underlying technology.

For our purposes, we assume there is an au-dio corpus where orthographic transcriptions areavailable for at least some subset of the recordings.If no orthographic transcriptions exist at all, thenthere are a number of techniques around audio-only corpus analysis that can be applied, such asthose explored in the Zero-Resource Speech Chal-lenges (Dunbar et al., 2017), but those approachesare outside of the scope of this paper.

We also assume that the corpus at hand was cre-ated in a relatively ubiquitous tool for languagedocumentation — e.g. ELAN (Max Planck In-stitute for Psycholinguistics; Brugman and Rus-sel, 2004), Praat (Boersma and Weenink, 2001),or Transcriber (DGA, 2014) — for which therewould be a ready-to-use conversion tool to con-vert the corpus into a common format that can beunderstood by the toolkit. Finally, we assume thatthe linguist has access to a tool to create keyboardsfor use on a desktop/laptop system, such as Key-man (SIL), so an appropriate orthography can beused, and that the linguist has access to an entryinput method for the International Phonetic Alpha-bet (IPA).

Broadly, we see two areas where automationtoolkits for language documentation can make sig-nificant headway still: automatic analysis of ex-isting data in the corpus, and automatic process-ing of unannotated data. To facilitate the discus-sion below, Table 1 briefly walks through a rea-sonably comprehensive set of annotations for anaudio recording. Specifically, for each word in ourNafsan example utterance, Table 1 includes:

• a speaker label to indicate which of the speak-ers in the recording spoke this word

• the orthographic form, or simply spelling, ofthis word

• an interlinear gloss, which we will describe infurther detail below for readers who are unfa-miliar with this practice

• a part-of-speech tag, e.g. using the conven-tions from (Petrov et al., 2012)

• a phonemic transcription, e.g. in the Interna-tional Phonetic Alphabet

In addition, our example annotation shows a

translation of the entire utterance into a single lan-guage.1

3 Analysis of Existing Data

3.1 Text Analytics

In terms of automatic analysis of existing data,there appears to be a significant amount of low-hanging fruit around text-based analytics. Once alanguage worker has put the effort into preparinga corpus for use in a toolkit, by normalizing andconsolidating files, the data becomes suitable forautomatic analysis to obtain corpus-wide metrics.For example, a toolkit could easily emit the fol-lowing metrics on the orthographic transcriptionsin the corpus:

• the total number of words observed2

• the number of unique words observed

• a frequency-ranked wordlist, with wordcounts

• the average number of words in each utter-ance, and a histogram of words per utterance

These metrics can also be broken down byspeaker, if multiple speakers are present in the cor-pus and speaker annotations are available in themetadata. In that case, a toolkit could also au-tomatically compute statistics such as pointwisemutual information score for the usage of eachword by each speaker, or by speaker group, forexample if multiple varieties are represented inthe corpus. This may point to interesting inter-speaker or inter-variety lexical frequency differ-ences that merit further investigation. Per-speakerlexical analyses could also be used to inform field-work elicitation adjustments, e.g. when somewords were not elicited from a particular speakerin the data set. For a literature review describing alarge number of automatic analysis options in thisspace, see (Wieling and Nerbonne, 2015).

Frequency-ranked wordlists in particular canalso be helpful for quality assurance purposes:

1 Fieldwork linguists can, naturally, choose to create thistranslation in any language. In our example, for expositorypurposes, we used English.

2 We use ”words” here to mean specifically orthographicunits separated by whitespace in the target-language orthog-raphy; if such separation is not part of the orthography, thenmore complex approaches are needed, as is the case whenusers wish to run these analyses based on lexical roots (aftermorphological analysis).

15

Speaker Spelling Gloss Part-of-Speech PhonemesSPEAKER1 waak pig NOUN wakSPEAKER1 nen that DET nanSPEAKER1 i= 3SGREALISSUBJECT PRON iSPEAKER1 pas chase VERB pasSPEAKER1 =ir 3PLOBJECT PRON ir

Table 1: Annotations for an example recording in Nafsan (ISO 639 erk), DOI 10.4225/72/575C6E9CC9B6A.Here, the English translation would be “That pig chased them.”

words at the bottom of the list may occur onlyas one-off hapax legomena, but they may also bemisspelled versions of more frequent words. Re-viewing a frequency-ranked list may help find or-thographic inconsistencies in the data set. In fact,an automatic toolkit could also emit additional in-formation around annotation quality, such as:

• a frequency-ranked character list, with point-ers to utterances that contain rarely used char-acters (on the assumption that they may bedata entry mistakes)

• words that seem to be misspellings of anotherword, based on edit distance and/or context

For datasets in languages where comprehensive,clean wordlists are available, a transcription toolwith built-in spell-checking or word-completionfeatures would benefit users. If no clean wordlistexists yet, a frequency-based wordlist is a goodstart and can be human-curated. Going beyondjust orthographic words, finite-state technologiesmay also be needed, especially for morphologi-cally complex languages where stems may havemore variations than could ever be covered by asimple wordlist, as is frequently described in theliterature; two relevant recent surveys are (Mageret al., 2018; Littell et al., 2018).

If interlinear glosses are available in the corpus,the toolkit could also easily emit a list of all wordsand their glosses, highlighting words with multi-ple glosses, as these may (but do not necessarily)represent data entry inconsistencies; the same istrue for part-of-speech tags.

Where phoneme annotations are present, aforced-alignment with the orthographic transcrip-tion may be carried out to surface any outliers,as these may represent either orthographic orphonemic transcription errors (Jansche, 2014). Ofcourse, a phoneme frequency analysis could alsobe produced and may yield interesting insights.

It should also be straightforward, though thelinguistic value of this would remain to be deter-

mined, to apply an unsupervised automatic mor-phemic analyzer like Morfessor (Virpioja et al.,2013).

3.2 Visualization

Language documentation toolkits will benefitfrom integration with information visualizationtools. Visual data analysis can give a linguist ageneral overview of the nature of a corpus, andnarrow in on data to support new, detailed insights.For example, a broad view of a language collec-tion can be given by visualizing metadata fromthe corpus, representing the activity of the docu-mentation process as a heatmap of the recordingdates/times/participants. Date and time visualiza-tions could show patterns within recording loca-tions and events, indicating how long it took tocreate the corpus, and how old the data is.

Visualizations showing which time spans withina corpus have been annotated on particular annota-tion tiers would allow a linguist to quickly obtaina sense of the state of transcription. A representa-tion of transcription state showing what has beentranscribed, the types of existing transcriptions,with summaries of metadata, will assist the lan-guage worker to understand the shape of their par-allel layered data, and develop well-constructedcorpora. This information could help prevent un-expected data loss when moving from complexmulti-tier transcription to other modes of represen-tation (Thieberger, 2016).

Detailed visualizations can also represent lin-guistic features of the data, giving the linguist on-demand access to explore a variety of phenom-ena — e.g. using ANNIS (Krause and Zeldes,2016) — such as compositional structure of lin-guistic units (hierarchical data), associations be-tween words (relational data) and alternative wordchoices (Culy and Lyding, 2010).

Finally, a toolkit could output a high-leveloverview of statistics and visualizations for agiven corpus as a landing page: a descriptionof the collection for online publishing, or to fa-

16

cilitate discovery of the data within an archive(Thieberger, 2016).

4 Extending Automatic Processing ofUnannotated Data

4.1 Audio Analysis

Generally, part of the data set at hand will bemissing at least some of the annotation layers inour example above. At the most basic level, itmay not be known what parts of an audio col-lection contain human speech, what parts con-tain other types of sound (e.g. only nature orcar noise), or are even just quiet. A speech vs.non-speech classifier, known as a voice activitydetector or VAD (Ramirez et al., 2007), shouldbe built into the toolkit. One open-source VADis the WebRTC Voice Activity Detector, with aneasy-to-use Python wrapper in (Wiseman, 2016).Beyond voice activity detection, speaker identifi-cation or diarization may also be necessary. Inthis regard, the work recently done as part ofthe First DIHARD Speech Diarization Challenge(Church et al., 2018) is particularly relevant forfieldwork recordings, which are challenging cor-pora for these technologies.

Practically, entirely unannotated data sets arerare: usually at least the language of the data setis known. Where it isn’t, or where multiple lan-guages exist within a given recording, it will alsobe helpful to be able to identify the language spo-ken in each part of the audio, based on the record-ing alone, though this is a hard technical chal-lenge (Gonzalez-Dominguez et al., 2014). As afirst step, multiple existing recognizers could beexecuted, and their output could be analyzed to de-termine which one produces the most likely tran-script, using an approach like that by (Demsar,2006).

4.2 Automatic Phonemic and OrthographicTranscription

Once the location(s) and the language(s) of speechwithin the target audio recordings are known, therelevant audio fragments can be processed byspeech recognition systems such as Elpis (Foleyet al., 2018) or Persephone (Adams et al., 2018)in order to produce automatic machine hypothe-ses for phonemic and orthographic transcriptions.Today’s Elpis and Persephone pipelines are ma-turing, and are relatively complete packages, but

they could be made easier to use for people with-out backgrounds in speech science.

4.2.1 PronunciationsFor example, Elpis currently requires the lin-guist to provide a machine-readable grapheme-to-phoneme mapping. However, these may al-ready be available from current libraries (Deri andKnight, 2016), derivable from existing databaseslike Ausphon-Lexicon (Round, 2017), or presentin regional language collections such as the Aus-tralian Pronunciation Lexicon (Estival et al., 2014)and Sydney Speaks (CoEDL, 2018). Theseresources could be integrated directly into thetoolkit. Elpis currently also offers no sup-port for handling not-a-word tokens like numbers(van Esch and Sproat, 2017), but this could besupported using a relatively accessible grammarframework where the number grammar can be in-duced using a small number of examples (Gormanand Sproat, 2016).

4.2.2 Text CorporaToolkits should also facilitate easy ingestion of ad-ditional text corpora, which is particularly usefulfor training large-vocabulary speech recognitionsystems. For some endangered languages, textcorpora such as Bible translations or Wikipediadatasets may be available to augment low-volumedata sets coming in from language documentationwork. Of course, the content of these sourcesmay not be ideal for use in training conversationalsystems, but when faced with low-data situations,even out-of-domain data tends to help. As moretext data becomes available, it would also be goodto sweep language model parameters (such as then-gram order) automatically on a development setto achieve the best possible result.

4.2.3 Audio Quality EnhancementIn terms of audio quality, it’s worth pointingout that audio recorded during fieldwork is of-ten made in noisy conditions, with intrusive an-imal and bird noises, environmental sounds, orair-conditioners whirring. Noisy corpora may be-come easier to process if automatic denoising andspeech enhancement techniques are integrated intothe pipeline. Loizou (2013) provides a recentoverview of this particular area.

4.2.4 Multilingual ModelsTo further improve the quality of the outputproduced by the automatic speech recognition

17

(ASR) systems within toolkits such as Elpis andPersephone, multilingual acoustic models can betrained on data sets from multiple languages(Besacier et al., 2014; Toshniwal et al., 2018).This could allow so-called zero-shot model usage,meaning the application of these acoustic modelsto new languages, without any training data at allin the target language. Of course, the accuracyof such an approach would depend on how sim-ilar the target language is to the languages thatwere included in the training data for the multilin-gual model. Another approach is simply to reusea monolingual acoustic model from a similar lan-guage. Either way, these models can be tailored to-wards the target language as training data becomesavailable through annotation.

4.2.5 Automatic Creation of Part-of-SpeechTags and Interlinear Glosses

Beyond automatically generating hypotheses forphonemic and orthographic transcriptions, itwould technically also be relatively straightfor-ward to produce candidate part-of-speech tags orinterlinear glosses automatically. Both types oftags tend to use a limited set of labels, as describedby e.g. the Universal Part-of-Speech Tagset doc-umentation (Petrov et al., 2012) and the LeipzigGlossing Rules (Comrie et al., 2008). Wherewords occur without a part-of-speech tag or an in-terlinear gloss in the corpus (for example, becausea linguist provided an orthographic transcriptionbut no tag, or because the orthographic transcrip-tion is an automatically generated candidate), a tagcould be proposed automatically.

Most straightforwardly, this could be done byre-using an existing tag for that word, if the wordwas already tagged in another context. Wheremultiple pre-existing interlinear glosses occur forthe same surface form, it would be possible to con-textually resolve these homographs (Mazovetskiyet al., 2018). For part-of-speech tagging, manymature libraries exist, such as the Stanford Part-of-Speech Tagger (Toutanova et al., 2003). However,in both cases, given the limited amount of trainingdata available within small data sets, accuracy islikely to be low, and it may be preferable to simplyhighlight the existing set of tags for a given wordfor a linguist to resolve. Surfacing these cases toa linguist may even bring up an annotation qualityproblem that can be fixed (if, in fact, there shouldhave been only one tag for this form).

Deriving interlinear glosses or part-of-speech

tags for new surface forms that have not previouslybeen observed could be significantly harder, de-pending on the target language. Achieving reason-able accuracy levels will require the use of mor-phological analyzers in many languages, e.g. asin (Arkhangeskiy et al., 2012). In polysyntheticlanguages in particular, development of such ana-lyzers would be a challenging task (Mager et al.,2018; Littell et al., 2018), though see (Haspelmath,2018) for a cautionary note on the term polysyn-thetic.

4.2.6 Automatic Machine Translation (MT)Many language documentation workflows involvethe translation of sentences from the target-language corpus into some other language, suchas English, to make the content more broadly ac-cessible. In effect, this creates a parallel cor-pus which can be used for training automatic ma-chine translation (MT) systems. These MT sys-tems could then, in turn, be used to propose trans-lation candidates for parts of the corpus that arestill lacking these translations: this may be be-cause the linguist did not yet have time to trans-late their transcriptions, or because only machine-generated hypotheses are available for the tran-scriptions (in which case the cascading-error ef-fect typically causes accuracy of translations to belower). Such candidate translations would still re-quire human post-editing, but for limited-domainapplications, machine translations may be suffi-ciently accurate to accelerate the annotation pro-cess. Of course, where other parallel corpora exist(e.g. bilingual story books, or religious materiallike the Bible), these can also be used as train-ing data. In addition, any existing bilingual wordlexicons (like dictionaries) can also help in build-ing machine translation systems, as in (Klemen-tiev et al., 2012).

Many high-quality open-source machine trans-lation toolkits already exist, e.g. SockEye (Hieberet al., 2017), TensorFlow (Luong et al., 2017), andMoses (Koehn et al., 2007). Phrase-based machinetranslation still tends to yield better results thanneural machine translation (NMT) for small datasets (Ostling and Tiedemann, 2017), but multilin-gual NMT models seem promising for cases wheresimilar languages exist that do have large amountsof training data (Johnson et al., 2017; Gu et al.,2018). Recent advances even allow the creationof speech-to-translated-text models (Bansal et al.,2018) and enable the use of existing translations

18

to enhance ASR quality (Anastasopoulos and Chi-ang, 2018).

5 Putting structured data and models towork more broadly

A toolkit designed to facilitate transcription andannotation efforts can benefit language workersand language communities beyond just the an-notation output. Structured data sets used inthese toolkits are ripe for conversion into formatsneeded for various other tasks, such as dictionarycreation, e.g. Electronic Text Encoding and Inter-change (TEI, 2018), or for dictionary publishingpipelines, as in Yinarlingi (San and Luk, 2018).Exported data could be formatted for import intoa range of dictionary/publishing applications suchas lexicon apps for mobile devices or ePub for-mat (Gavrilis et al., 2013) for publishing on a widerange of devices such as smartphones, tablets,computers, or e-readers. Automatically generatedpronunciations, word definitions, part-of-speechinformation and example sentences could easilybe included. For difficult work such as automaticlemmatization, the output could be presented in asimple interface for human verification/correctionbefore publication (Liang et al., 2014).

For languages to thrive in the digital domain,some technologies are considered essential: astandardized encoding; a standardized orthogra-phy; and some basic digital language resources,such as a corpus and a spell checker (Soria, 2018).If toolkits make it easy to create structured datasets and models for these languages, then these re-sources can also be applied outside the fields oflanguage documentation, lexicography, and otherlinguistic research fields.

For example, Australia has a vibrantcommunity-owned Indigenous media broad-casting sector. For these users, there is potentialto re-use the ASR models to generate closed cap-tions (subtitles) for the broadcast material. ASRtranscription and translation technologies couldbe used to enrich the output of these Indigenouscommunity media broadcasters, which wouldyield a significant improvement in the accessibil-ity of broadcast content. Another option would beto facilitate the creation of smartphone keyboardswith predictive text in the target languages, usingtoolkits like Keyman (SIL).

Language archives like PARADISEC could alsobenefit by applying these models, once created, on

existing content for the same language that mayhave been contributed by other linguists, or thatmay have come in through other avenues. It mayeven be possible for these archives to offer web-based services to language communities, e.g. toallow them to use machine translation models for agiven language in a web-based interface. Linguistscould archive their models alongside their corpus;for some collections, the published models mayinherit the same access permissions as the train-ing data used in their creation, while some modelsmay be able to be published under less restrictiveconditions.

5.1 Data Set Availability

In general, to support the development of languagetechnologies like the ones we described earlier,it is critical that software engineers are aware of,and have access to data sets reflecting the diversityof endangered languages. Already, data sets areavailable for a range of languages, and in formatssuitable for ASR, text-to-speech synthesis (TTS)and other tasks on the Open Speech and LanguageResources website (Povey, 2018). The additionof ML-ready data sets for more endangered lan-guages would enable software engineers, whoseprimary concern may not be the language docu-mentation process, to be involved in developingand refining speech tools and algorithms. How-ever, access protocols and licenses in place for ex-isting data sets can prohibit experimental develop-ment. Recording corpora specifically for the pur-poses of enabling ML experiments would avoidpotentially lengthy negotiations of access rights.

6 Reaching more languages

Extending the reach of a toolkit beyond the (typ-ically) few languages which are used when build-ing and testing is critical. Applying technology tomore diverse use cases encourages a tool to be-come more robust. With more use cases, a com-munity of support can grow. A community ofusers, contributors and developers around a toolkitis important to encourage potential participants(Foley, 2015), and to reduce the burden of supporton individual developers.

Being proactive in community discussions topublicize the abilities of tools, providing in-person access to training workshops, publishingonline support material and tutorials, and ensuringtools have high-quality documentation for differ-

19

ent levels of users, are all important (albeit labor-intensive) methods of promoting and encouraginglanguage technology to reach more languages.

7 Conclusion

Speech and language technology toolkits, de-signed specially for users without backgroundsin speech science, have the potential for signifi-cant impact for linguists and language communi-ties globally. Existing toolkits can be enhancedwith richer feature sets, and connection with work-flow processes, helping language documentationworkers. These toolkits enable language docu-mentation workers to focus on deeper linguisticresearch questions, by presenting the results of au-tomated systems in ways that let language expertseasily verify and correct the hypotheses, yieldingannotation speed-ups. At the same time, mak-ing these technologies and the structured data theyhelp produce more widely available would benefitlanguage communities in many ways.

Most of the technologies described here arereadily available and easily implemented, whileothers are still highly experimental in their appli-cation and potential benefits for Indigenous lan-guages. With language technology as a wholemaking rapid progress, and with an increasingamount of dialogue between fieldwork linguistsand computer scientists, it is an exciting time to beworking on computational methods for languagedocumentation, with many advances that look tobe within reach for the near future.

Acknowledgments

We would like to thank all the participants in theTranscription Acceleration Project (TAP), run bythe Centre of Excellence for the Dynamics of Lan-guage, for many fruitful discussions on how tech-nology could be applied in the language documen-tation context to assist fieldwork linguists. Wewould like to add a special word of thanks to NickThieberger, who provided the Nafsan example,and who offered valuable input on a draft versionof this paper. We would also like to thank ZaraMaxwell-Smith and Nicholas Lambourne for allof their insightful suggestions and contributions.

ReferencesOliver Adams, Trevor Cohn, Graham Neubig, Hilaria

Cruz, Steven Bird, and Alexis Michaud. 2018. Eval-uating phonemic transcription of low-resource tonal

languages for language documentation. In Proceed-ings of LREC 2018.

Antonis Anastasopoulos and David Chiang. 2018.Leveraging translations for speech transcription inlow-resource settings. CoRR, abs/1803.08991.

Timofey Arkhangeskiy, Oleg Belyaev, and Arseniy Vy-drin. 2012. The creation of large-scale annotatedcorpora of minority languages using uniparser andthe eanc platform. In Proceedings of COLING2012: Posters, pages 83–92. The COLING 2012 Or-ganizing Committee.

Sameer Bansal, Herman Kamper, Karen Livescu,Adam Lopez, and Sharon Goldwater. 2018.Low-resource speech-to-text translation. CoRR,abs/1803.09164.

Laurent Besacier, Etienne Barnard, Alexey Karpov,and Tanja Schultz. 2014. Automatic speech recog-nition for under-resourced languages: A survey.Speech Commun., 56:85–100.

Brigitte Bigi. 2015. SPPAS: Multi-lingual approachesto the automatic annotation of speech. The Phoneti-cian, 111-112(ISSN:0741-6164):54–69.

Paul Boersma and David Weenink. 2001. Praat, a sys-tem for doing phonetics by computer. 5:341–345.

Hennie Brugman and Albert Russel. 2004. Annotatingmulti-media/multi-modal resources with ELAN. InProceedings of LREC 2004.

Kenneth Church, Christopher Cieri, AlejandrinaCristia, Jun Du, Sriram Ganapathy, MarkLiberman, and Neville Ryant. 2018. TheFirst DIHARD Speech Diarization Challenge.https://coml.lscp.ens.fr/dihard/index.html.

CoEDL. 2018. Sydney Speaks Project.http://www.dynamicsoflanguage.edu.au/sydney-speaks.

Bernard Comrie, Martin Haspelmath, and BalthasarBickel. 2008. The Leipzig glossing rules: Con-ventions for interlinear morpheme-by-morphemeglosses.

Chris Culy and Verena Lyding. 2010. Visualizationsfor exploratory corpus and text analysis. In Proceed-ings of the 2nd International Conference on CorpusLinguistics CILC-10, pages 257–268.

Janez Demsar. 2006. Statistical comparisons of classi-fiers over multiple data sets. J. Mach. Learn. Res.,7:1–30.

Aliya Deri and Kevin Knight. 2016. Grapheme-to-phoneme models for (almost) any language. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), volume 1, pages 399–408.

20

DGA. 2014. TranscriberAG: a tool for seg-menting, labeling and transcribing speech.http://transag.sourceforge.net.

Ewan Dunbar, Xuan-Nga Cao, Juan Benjumea, JulienKaradayi, Mathieu Bernard, Laurent Besacier,Xavier Anguera, and Emmanuel Dupoux. 2017.The zero resource speech challenge 2017. CoRR,abs/1712.04313.

Gautier Durantin, Ben Foley, Nicholas Evans, andJanet Wiles. 2017. Transcription survey.

Daan van Esch and Richard Sproat. 2017. An ex-panded taxonomy of semiotic classes for text nor-malization. In Proceedings of Interspeech 2017.

Dominique Estival, Steve Cassidy, Felicity Cox, andDenis Burnham. 2014. Austalk: an audio-visualcorpus of australian english. In Proceedings ofLREC 2014.

Ben Foley. 2015. Angkety map digital resource report.Technical report.

Ben Foley, Josh Arnold, Rolando Coto-Solano, Gau-tier Durantin, T. Mark Ellison, Daan van Esch, ScottHeath, Frantisek Kratochvıl, Zara Maxwell-Smith,David Nash, Ola Olsson, Mark Richards, Nay San,Hywel Stoakes, Nick Thieberger, and Janet Wiles.2018. Building Speech Recognition Systems forLanguage Documentation: The CoEDL EndangeredLanguage Pipeline and Inference System. In Pro-ceedings of the 6th International Workshop on Spo-ken Language Technologies for Under-resourcedLanguages (SLTU 2018).

Dimitris Gavrilis, Stavros Angelis, and Ioannis Tsou-los. 2013. Building interactive books using EPUBand HTML5. In Ambient Media and Systems, pages31–40, Cham. Springer International Publishing.

Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno,and Hasim Sak. 2014. Automatic language identifi-cation using long short-term memory recurrent neu-ral networks. In Proceedings of Interspeech 2014.

Kyle Gorman and Richard Sproat. 2016. Minimally su-pervised number normalization. Transactions of theAssociation for Computational Linguistics, 4:507–519.

Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O.K.Li. 2018. Universal neural machine translation forextremely low resource languages. In Proceed-ings of the 2018 Conference of the North Ameri-can Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long Papers), pages 344–354. Associationfor Computational Linguistics.

Martin Haspelmath. 2018. The last word on polysyn-thesis: A review article. 22:307–326.

Felix Hieber, Tobias Domhan, Michael Denkowski,David Vilar, Artem Sokolov, Ann Clifton, and MattPost. 2017. Sockeye: A toolkit for neural machinetranslation. CoRR, abs/1712.05690.

Martin Jansche. 2014. Computer-aided quality assur-ance of an icelandic pronunciation dictionary. InProceedings of the Ninth International Conferenceon Language Resources and Evaluation (LREC’14),Reykjavik, Iceland. European Language ResourcesAssociation (ELRA).

Melvin Johnson, Mike Schuster, Quoc V. Le, MaximKrikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,Fernanda Viegas, Martin Wattenberg, Greg Corrado,Macduff Hughes, and Jeffrey Dean. 2017. Google’smultilingual neural machine translation system: En-abling zero-shot translation. Transactions of the As-sociation for Computational Linguistics, 5:339–351.

Alexandre Klementiev, Ann Irvine, Chris Callison-Burch, and David Yarowsky. 2012. Toward statisti-cal machine translation without parallel corpora. InProceedings of the 13th Conference of the EuropeanChapter of the Association for Computational Lin-guistics, EACL ’12, pages 130–140, Stroudsburg,PA, USA. Association for Computational Linguis-tics.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. InThe Annual Conference of the Association for Com-putational Linguistics.

Thomas Krause and Amir Zeldes. 2016. Annis3: Anew architecture for generic corpus query and vi-sualization. Digital Scholarship in the Humanities,31(1):118–139.

Y Liang, K Iwano, and Koichi Shinoda. 2014. Simplegesture-based error correction interface for smart-phone speech recognition. pages 1194–1198.

Patrick Littell, Anna Kazantseva, Roland Kuhn, AidanPine, Antti Arppe, Christopher Cox, and Marie-Odile Junker. 2018. Indigenous language technolo-gies in canada: Assessment, challenges, and suc-cesses. In Proceedings of the 27th InternationalConference on Computational Linguistics, pages2620–2632. Association for Computational Linguis-tics.

Philipos C. Loizou. 2013. Speech Enhancement: The-ory and Practice, 2nd edition. CRC Press, Inc.,Boca Raton, FL, USA.

Minh-Thang Luong, Eugene Brevdo, and Rui Zhao.2017. Neural machine translation (seq2seq) tutorial.https://github.com/tensorflow/nmt.

21

Manuel Mager, Ximena Gutierrez-Vasques, GerardoSierra, and Ivan Meza. 2018. Challenges of lan-guage technologies for the indigenous languages ofthe Americas.

Gleb Mazovetskiy, Kyle Gorman, and Vitaly Nikolaev.2018. Improving homograph disambiguation withsupervised machine learning. In LREC.

Alexis Michaud, Oliver Adams, Trevor Anthony Cohn,Graham Neubig, and Severine Guillaume. 2018. In-tegrating automatic transcription into the languagedocumentation workflow: Experiments with Na dataand the Persephone toolkit.

Robert Ostling and Jorg Tiedemann. 2017. Neural ma-chine translation for low-resource languages. CoRR,abs/1708.05729.

Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012.A universal part-of-speech tagset. In Proceed-ings of the Eight International Conference on Lan-guage Resources and Evaluation (LREC’12), Istan-bul, Turkey. European Language Resources Associ-ation (ELRA).

Dan Povey. 2018. Open speech and language re-sources. http://www.openslr.org.

The Language Archive Max Planck Institute for Psy-cholinguistics. ELAN. https://tla.mpi.nl/tools/tla-tools/elan/.

J. Ramirez, J. M., and J. C. 2007. Voice activity de-tection. fundamentals and speech recognition systemrobustness. In Robust Speech Recognition and Un-derstanding. I-Tech Education and Publishing.

Erich R. Round. 2017. The AusPhon-Lexicon project:Two million normalized segments across 300 Aus-tralian languages.

Nay San and Ellison Luk. 2018. Yinarlingi:An R package for testing Warlpiri lexicon data.https://github.com/CoEDL/yinarlingi.

SIL. Keyman: Type to the world in your language.https://keyman.com/.

Claudia Soria. 2018. Digital Language Sur-vival Kit. http://www.dldp.eu/en/content/digital-language-survival-kit.

TEI. 2018. Guidelines for electronic text encoding andinterchange. http://www.tei-c.org/P5/.

Nick Thieberger. 2016. Language documen-tation tools and methods summit report.https://docs.google.com/document/d/1p2EZufVIm2fOQy4aIZ100jnZjfcuG-Bb9YKiXMggDRQ.

Shubham Toshniwal, Tara N. Sainath, Ron J. Weiss,Bo Li, Pedro J. Moreno, Eugene Weinstein, andKanishka Rao. 2018. Multilingual speech recog-nition with a single end-to-end model. 2018 IEEEInternational Conference on Acoustics, Speech andSignal Processing (ICASSP), pages 4904–4908.

Kristina Toutanova, Dan Klein, Christopher D. Man-ning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network.In Proceedings of the 2003 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics on Human Language Technology- Volume 1, NAACL ’03, pages 173–180, Strouds-burg, PA, USA. Association for Computational Lin-guistics.

Sami Virpioja, Peter Smit, Stig-Arne Gronroos, andMikko Kurimo. 2013. Morfessor 2.0: Python Imple-mentation and Extensions for Morfessor Baseline.Technical report.

Martijn Wieling and John Nerbonne. 2015. Advancesin dialectometry. Annual Review of Linguistics,1(1):243–264.

John Wiseman. 2016. Python interface tothe WebRTC Voice Activity Detector.https://github.com/wiseman/py-webrtcvad.

22


OCR evaluation tools for the 21st century

Eddie Antonio Santos

Canadian Indigenous languages technology project, National Research Council CanadaAlberta Language Technology Lab, University of Alberta, Edmonton, Canada

[email protected]

Abstract

We introduce ocreval, a port of the ISRIOCR Evaluation Tools, now with Unicodesupport. We describe how we upgraded theISRI OCR Evaluation Tools to support mod-ern text processing tasks. ocreval supportsproducing character-level and word-level ac-curacy reports, supporting all characters rep-resentable in the UTF-8 character encodingscheme. In addition, we have implemented theUnicode default word boundary specificationin order to support word-level accuracy reportsfor a broad range of writing systems. We ar-gue that character-level and word-level accu-racy reports produce confusion matrices thatare useful for tasks beyond OCR evaluation—including tasks supporting the study and com-putational modeling of endangered languages.

1 Introduction

Optical Character Recognition (OCR) is the pro-cess of detecting text in a image, such as a scan ofa page, and converting the written language intoa computer-friendly text format. While OCR iswell-established for majority languages (TesseractLangdata), the same cannot be said for endangeredlanguages (Hubert et al., 2016). When developerscreate OCR models for low-resources languages,they must have a tool for evaluating the accuracyof their models. This will aid in making decisionson how to model the language, and how to tweakparameters in order to accurately recognize text inthat language, especially when labelled materialsare scarce.A further complication encountered when de-

veloping OCR models for endangered languagesis that of orthography—a languagemay havemanyorthographies, or may not have a standardized or-thography at all. And if any orthographies ex-ist, they likely contain characters beyond the non-accented Latin alphabet. IPA-influenced glottal

stop (ʔ) and schwa (ə) characters have a tendencyto sneak into new orthographies. Historically,computer software provides poor support for char-acters outside the range of those characters com-monly found in Western European majority lan-guage orthographies.The ISRI OCR Evaluation Tools were last re-

leased in 1996 by the University of Nevada,Las Vegas’s Information Science Research Insti-tute (Rice and Nartker, 1996). Despite their age,the tools continue to be useful to this day forthe task of evaluating new OCR models (Hubertet al., 2016). Its accuracy and wordacc toolsare of particular utility. However, being some-what dated, it has limited and awkward supportfor non-European orthographies. Since the tools’1996 release, the Unicode standard has graduallyimproved support for non-European characters, in-troducing many new scripts and characters. How-ever, the ISRI OCR Evaluation Tools have notbeen upgraded since.In this paper, we present an updated version of

the ISRI OCR Evaluation Tools, which we namedocreval. Our contributions include

• Port the ISRI OCR Evaluation Tools ontomodern macOS and Ubuntu Linux systems;

• Add support for UTF-8, enabling the analysisof 1,112,064 unique characters;

• Implement the default Unicode word bound-ary specification to evaluate word accuracy.

We describe in more depth what we havechanged and why; and we postulate how ocrevalcan be used to support tasks unrelated to OCRthat may benefit people working in endangeredlanguage study and tool development.Before continuing, we would be remiss not to

mention OCRevalUAtion (Carrasco, 2014), an al-ternative to the toolkit presented in this paper.

23

2 How does ocreval extend the ISRIOCR Evaluation Tools?

ocreval is a port of the ISRI OCR EvaluationTools to modern systems. This means that muchof the codebase and functionality is the same asthe codebase written in 1996; however, the origi-nal code presents many incompatibilities with cur-rent systems (usage of older programming lan-guage features and reliance on out-of-date sourcecode libraries). When we downloaded the ISRIOCR Evaluation Tools for the first time, its C pro-gramming language source code would not com-pile (translate source code into a usable applica-tion). After a fewmodifications to the source code,we were able to compile and run the ISRI OCREvaluation Tools on our systems.The second problem is how text was represented

in the system. The ISRI OCR Evaluation Toolswere written in the early days of Unicode, whenUnicode was strictly a 16-bit character set. Assuch, the ISRI OCR Evaluation Tools can handlethe first 65,536 characters defined in the Unicodestandard; however, as of Unicode 11.0 (UnicodeConsortium, 2018), there are 1,112,064 possiblecharacters total.1 Only 137,374 (12.33%) char-acters are defined in the Unicode standard, withthe rest of the code space reserved for future use.This means that the ISRI OCR Evaluation Toolscan only represent half of the characters definedin the modern day Unicode standard. Among thecharacters that are not representable in the ISRIOCR Evaluation Tools are characters in recentlycreated writing systems, such as the Osage alpha-bet (Everson et al., 2014), as well as historicalwriting systems (Cuneiform, Linear B, Old Italic).Thus, ocreval extends the ISRI OCR EvaluationTools by internally representing characters usingthe UTF-32 character encoding form, allowing forthe analysis of characters from writing systemsboth new and old.

2.1 UTF-8 support

The most significant hurdle remaining is how Uni-code characters are input and output into ocreval.As of October 2018, UTF-8 is the most commoncharacter encoding scheme on the web, used for92.4% of websites (W3Techs). This is not surpris-

1 There are 1,114,112 values defined in the Unicode 11.0code space, however 2048 of these are surrogate characterswhich cannot encode a character by themselves; hence, weremoved these characters from the total.

ing, as UTF-8 represents all 1,112,064 characterspossible in the Unicode standard in a way that isbackwards-compatible with 7-bit ASCII. The ISRIOCR Evaluation Tools predate the widespreadadoption of UTF-8; as such, it assumes that anytext input is encoded in ISO-8859-1 or “Latin-1”encoding. Latin-1, intended for use with West-ern European languages, only encodes about 192printable characters, most of which are Latin al-phabetic characters; as such, Latin-1 is usually in-adequate for encoding endangered and minoritylanguages. The ISRI OCR Evaluation Tools doessupport 16-bit Unicode (65,536 characters), how-ever, it uses an ad hoc format called “ExtendedASCII”. The ISRI OCR Evaluation Tools bundledthe uni2asc and asc2uni tools to convert to andfrom the Extended ASCII format and the (nowoutdated) 16-bit UCS-2 character encoding. Any-body wishing to use characters outside the range ofLatin-1 would have to first use uni2asc before us-ing their documents with any of the available tools.If their original documents were encoded in UTF-8, they would have to do two conversions: first,convert from UTF-8 to UCS-2 using a tool such asiconv; then, convert the UCS-2 file into ExtendedASCII using uni2asc.In a timewhenUTF-8 exists and is readily avail-

able, we find that Extended ASCII is too much ofa hassle. As such, we modified the file readingand writing routines used in all utilities providedin ocreval to open files in the UTF-8 characterencoding scheme exclusively, obviating the needto convert into an ad hoc format such as ExtendedASCII. Since all input and output is done in UTF-8, we removed the now-redundant asc2uni anduni2asc utility programs from ocreval.

2.2 Unicode word segmentationOne of the ISRI OCR Evaluation Tools’s mostuseful utilities is wordacc. As its name im-plies, wordacc computes the recognition accuracyfor entire words, rather than for single charactersalone. However, what constituted as a “word” wasrather narrowly defined in the previous version.Originally, a “word” was a series of one or moreconsecutive characters in the range of lowercaseASCII characters (U+0061–U007A), or lowercaseLatin-1 characters (U+00DF–U+00F6, U+00F8–U+00FF). Input would have to be converted tolowercase before calculating word accuracy. Anyother characters were not considered to be part of aword, and hence, would not appear in word accu-

24

racy summaries or confusion matrices. This nar-row definition of a “word” limits the usefulness ofwordacc, even for Latin-derived scripts.To broaden its usefulness, we changed how

ocreval finds words in text by adopting Uni-code’s default word boundary specification (Uni-code Standard Annex #29). This specifies a rea-sonable default for finding the boundaries betweenwords and other words, and between words andnon-word characters. Then, to find words proper,we extract character sequences between bound-aries that start with a “word-like” character, suchas letters, syllables, digits, and symbols that fre-quently appear as parts of words. We also con-sidered private-use characters to be “word-like”characters so that if a language’s orthography isnot yet encoded or is poorly-supported in Unicode,its characters can be represented using private-usecharacters—which are set aside by the Unicodestandard as allocated code points to represent anycharacter the user requires.A caveat is that this algorithm will not work for

all writing systems. Scripts like Thai, Lao, andKhmer, which do not typically have any spacesseparating words, will not be handled properly. Assuch, this is not a “one-size-fits-all” solution, andmay need to be tailored depending on the language.

2.3 Installing ocreval

In all cases, the user must have a basic understand-ing of the command line interface of their com-puter. However, we have tried to document andstreamline the process when possible.

ocreval can be installed on a modern macOSsystem using the Homebrew package manager:

$ brew tap eddieantonio/eddieantonio$ brew install ocreval

On Ubuntu Linux, the tools can be installedfrom its source code. After downloading thezip archive,2 install all system dependencies andextract the contents of the source code archive.Within the newly created directory, issue the makecommand from the command line:

$ sudo apt update$ sudo apt install build -essential \

libutf8proc -dev unzip$ unzip ocreval -master.zip$ cd ocreval -master/$ make$ sudo make install

2 https://github.com/eddieantonio/ocreval/archive/master.zip

ocreval can also be installed on Windows 10within the Ubuntu app, obtainable in the Microsoftapp store. Copy the downloaded zip archive intothe Ubuntu subsystem, then follow the same stepsas the Ubuntu Linux install.

2.4 The importance of open source

Software is open source when its source code ispublicly-accessible, under a license that permitsanyone to make modifications and share the mod-ifications with others. We have released ocrevalas open source for many reasons: it maximizes theamount of people that can benefit from the toolby making it freely-accessible; hosting the soft-ware on the collaborative coding platform GitHuballows for people around the world to share en-hancements to ocreval for everybody’s benefit;and the ISRI OCR Evaluation Tools were origi-nally released as open source.

ocreval is maintained on GitHub at https://github.com/eddieantonio/ocreval. OnGitHub, changes to the source code are transparentand publicly-visible. Contributions are welcomein the form of issue reports and pull requests.Issue reports alert us to any bugs or inadequaciesfound in the currently published version of the soft-ware; pull requests allow volunteers to write sug-gested source code enhancements to share with therest of the community. Creating an issue report ora pull request both require a GitHub account. Wewelcome anyone who wants to contribute to joinin; no contribution is too small!The ISRI OCR Evaluation Tools were released

under an open source licence. The significanceof this cannot be overstated. If the originalsource code was not available on (the now defunct)Google Code repository, this paper would not havebeen possible. As such, our contributions are alsoreleased under an open source license, in the hopesthat it may also be useful in 20 years time.

3 How can ocreval help endangeredlanguages?

Hubert et al. (2016) have already used ocreval toevaluate the accuracy of OCRmodels for NorthernHaida, a critically-endangered language spoken inWestern Canada. Unicode support was integral torepresenting and evaluating the idiosyncratic or-thography specific to the corpus in question (whichdiffers from modern-day Haida orthographies).Previously, we mentioned that among the ISRI

25

OCR Evaluation Tools’ most useful utilities areaccuracy and wordacc. We see these utilities inocreval to be much more general-purpose tools,not only limited to the evaluation of OCR output.Fundamentally, these tools compare two text files,producing a report of which character sequencesare misclassified or “confused” for other charac-ter sequences. This concept is useful beyond theevaluation of OCR models.

One possible non-OCR application could be tostudy language change. Parallel texts—whetherit be prose, or a word list—could be fed intoaccuracy. One part of the input pair would berepresentative of language spoken or written in themodern day, and the other part of the input wouldbe a historical example, reconstructed text, or textfrom a distinct dialect. accuracy will then pro-duce minimal “confusions”, or sequences of com-monly misclassified sequences. Since accuracyprints statistics for how often a confusion is found,accuracy inadvertently becomes a tool for re-porting systematic changes, along with how oftenthe effect is attested. Preliminary work by Arppeet al. (2018) used ocreval to compare Proto-Algonquian to contemporary Plains Cree. Us-ing an extensive database with over ten thousandpairings of morphologically simple and complexword stems—mapping each modern Cree wordform with the corresponding reconstructed Proto-Algonquian form—they found that the contents ofthe confusion matrix matched quite closely withthe posited historical sound change rules. In addi-tion, the confusion matrix can be used to quantifyhow often a particular rule applies in the overallCree vocabulary. The benefit of ocreval’s addedUnicode support facilitates this use case, as soundchange rules are hardly ever representable in onlyLatin-1 characters.

Lastly, throughout this paper we have made theassumption that corpora are encoded in UTF-8—thus, using ocreval should be straightforward.However, this is not always the case for minor-ity language resources, even if they are encodeddigitally. One way minority language texts mayhave been encoded is by “font-encoding” or “font-hacking”. This is a specially-designed font over-rides the display of existing code points, as op-posed to using existing Unicode code points as-signed for that particular orthography. This may bebecause the font was defined for pre-Unicode sys-tems, or because Unicode lacked appropriate char-

acter assignments at the development of said font.For example, in place of the ordinary character for“©”, the font will render “ǧ”, and in place of “¬”,the font will render “ĺ”. For these cases, ocrevalalone is insufficient; an external tool must be usedsuch as convertextract (Pine and Turin, 2018).

4 Conclusion

We presented ocreval, an updated version ofthe ISRI OCR Evaluation Tools. ocreval pro-vides a suite of tools for evaluating the accuracyof optical character recognition (OCR) models—however, we postulate that the evaluation tools canbe generalized to support other tasks. Our contri-butions include porting the old codebase such thatit works on modern systems; adding support forthe ubiquitous UTF-8 character encoding scheme,as well internally representing characters using thefull Unicode code space; and the implementationof the default Unicode word boundary specifica-tion, which facilitates finding words in a varietyof non-Latin texts. We released ocreval onlineas open-source software, with the intent to makeour work freely-accessible, as well as to encouragecontributions from the language technology com-munity at large.

Acknowledgments

The author wishes to thank Isabell Hubert for in-volving us in this project; the author is grateful toAntti Arppe, Anna Kazantseva, and Jessica Santosfor reviewing early drafts of this paper.

ReferencesAntti Arppe, Katherine Schmirler, Eddie Antonio

Santos, and Arok Wolvengrey. 2018. Towards asystematic and quantitative identification of histor-ical sound change rules — Algonquian historicallinguistics meets optical character recognitionevaluation. Working paper presented in AlbertaLanguage Technology Lab seminar, November19, 2018. http://altlab.artsrn.ualberta.ca/wp-content/uploads/2019/01/OCR_and_proto-Algonquian_nonanon.pdf.

Rafael C Carrasco. 2014. An open-source OCR eval-uation tool. In Proceedings of the First Interna-tional Conference on Digital Access to Textual Cul-tural Heritage, pages 179–184. ACM. https://github.com/impactcentre/ocrevalUAtion.

Michael Everson, Herman Mongrain Lookout, andCameron Pratt. 2014. Final proposal to encode the

26

Osage script in the UCS. Technical report, ISO/IECJTC1/SC2/WG2. Document N4619.

Homebrew. 2009. Homebrew: The missing packagemanager for macOS. https://brew.sh/. (Ac-cessed on 10/22/2018).

Isabell Hubert, Antti Arppe, Jordan Lachler, and Ed-die Antonio Santos. 2016. Training & quality as-sessment of an optical character recognition modelfor Northern Haida. In LREC.

Aidan Pine and Mark Turin. 2018. Seeing the Heilt-suk orthography from font encoding through to Uni-code: A case study using convertextract. SustainingKnowledge Diversity in the Digital Age, page 27.

Stephen VRice and Thomas ANartker. 1996. The ISRIanalytic tools for OCR evaluation. UNLV/Informa-tion Science Research Institute, TR-96-02.

Tesseract Langdata. 2018. tesseract-ocr/langdata:Source training data for Tesseract for lots of lan-guages. https://github.com/tesseract-ocr/langdata. (Accessed on 10/22/2018).

The Unicode Consortium, editor. 2018. The UnicodeStandard, Version 11.0.0. The Unicode Consortium,Mountain View, CA.

Unicode Standard Annex #29. 2018. Unicode text seg-mentation. Edited by Mark Davis. An integral partof The Unicode Standard. http://unicode.org/reports/tr29/. (Accessed on 10/22/2018).

W3Techs. 2018. Usage statistics of charac-ter encodings for websites, october 2018.https://w3techs.com/technologies/overview/character_encoding/all. (Ac-cessed on 10/22/2018).

27


)BOEMJOH DSPTT�DVUUJOH QSPQFSUJFT JO BVUPNBUJD JOGFSFODF PG MFYJDBM DMBTTFT�" DBTF TUVEZ PG $IJOUBOH

0MHB ;BNBSBFWB∗

University of WashingtonDepartment of Linguistics

QHx�K�!mrX2/m

,SJTUFO )PXFMM∗University of WashingtonDepartment of Linguistics

FT?Qr2HH!mrX2/m

&NJMZ .� #FOEFSUniversity of WashingtonDepartment of Linguistics

2#2M/2`!mrX2/m

"CTUSBDUIn the context of the ongoing AGGREGA-TION project concerned with inferring gram-mars from interlinear glossed text, we explorethe integration of morphological patterns ex-tracted from IGT data with inferred syntac-tic properties in the context of creating im-plemented linguistic grammars. We presenta case study of Chintang, in which we put em-phasis on evaluating the accuracy of these pre-dictions by using them to generate a grammarand parse running text. Our coverage over thecorpus is low because the lexicon produced byour system only includes intransitive and tran-sitive verbs and nouns, but it outperforms anexpert-built, oracle grammar of similar scope.

� *OUSPEVDUJPOMachine-readable grammars are useful for lin-guistic hypothesis testing via parsing and tree-banking (selecting the best parse for each sen-tence) because they represent internally coherentmodels and can be explored automatically (Ben-der et al., 2011, 2012a; Fokkens, 2014; Müller,1999, 2016). Multilingual grammar engineeringframeworks such as CoreGram (Müller, 2015) andthe LinGO Grammar Matrix (Bender et al., 2002,2010) facilitate the development of machine-readable grammars by providing shared analyses,but still require significant human effort to selectthe appropriate analyses for a given language. Topartially automate this process, the AGGREGA-TION project takes advantage of the stored anal-yses in the Grammar Matrix and the linguisticinformation encoded in Interlinear Glossed Text(IGT). While at this stage the project efforts aremostly experimental in nature and focus on evalu-ating grammars obtained in this way, there alreadyhave been successful collaborations with docu-mentary linguists which in at least one case led

∗The first two authors made equal contribution.

to insights into the language’s morphological pat-terns (Zamaraeva et al., 2017).The IGT data format, widely used by linguists,

is well suited to inference tasks because it featuresdetailed morpheme-by-morpheme annotation andtranslations to high-resource languages (Xia andLewis, 2007; Bender et al., 2013). However, theinference processes required are heterogeneous.On the morphological side, an inference systemidentifies position and inflectional classes. On thesyntactic side, an inference system uses syntac-tic generalizations to identify broad characteristicsand defines lexical classes according to syntacticproperties. The challenge we address here is in in-tegrating the two. In this paper, we integrate a sys-tem that identifies morphological patterns in IGTdata with one that predicts syntactic properties todefine lexical classes that encode both morpholog-ical and syntactic features. We evaluate by repli-cating the case study of Bender et al. (2014), inwhich they automatically produced separate gram-mar fragments based on morphotactic and syntac-tic information and evaluated their system on acorpus of Chintang. We compare their results toour integrated system which includes both mor-photactic and syntactic properties.1

� $IJOUBOHChintang (ISO639-3: ctn) is a Sino-Tibetan lan-guage spoken by ∼5000 people in Nepal (Bickelet al., 2010; Schikowski et al., 2015). Here wesummarize the characteristics of the language thatare directly relevant to this case study.The relative order of the verb and its core ar-

guments (hereafter ‘word order’) in Chintang isdescribed as free by Schikowski et al. (2015), inthat all verb and argument permutations are valid

1Our code and sample data are available here:?iiTb,ff;BiXHBM;Xr�b?BM;iQMX2/mf�;;f`2T`Qf+QKTmi2Hj@+iM.

28

Tool TaskSIL Toolbox Export original IGT data to plain textXigt Convert data into a robust data model with associated processing packageINTENT Enrich the data: add phrase structure, POS tags to translation line

and project to source language line*OGFSFODF DPEF $SFBUF HSBNNBS TQFDJ˳DBUJPO� $BTF TZTUFN DBTF GSBNFT PG WFSCT.0. $SFBUF HSBNNBS TQFDJ˳DBUJPO� *O˴FDUJPO BOE QPTJUJPO DMBTTFTGrammar Matrix Create grammar on the basis of specificationsLKB Run the grammar on held-out sentences(BM+` ib/#UV) Treebank: Inspect the parses for correctness

Table 1: AGGREGATION pipeline, CPME indicates this paper’s contribution

in the language, with the felicity of these combi-nations being governed primarily by informationstructure. Although Schikowski et al. (2015) notethat no detailed analysis has been carried out re-garding what other factors condition word order,they say that SV, APV and AGTV are the mostfrequent orders that they observe.2 Dropping ofcore arguments is also common in the language(Schikowski et al., 2015).The case system follows an ergative-absolutive

pattern, with some exceptions (Stoll and Bickel,2012; Schikowski et al., 2015). In ergative-absolutive languages, the subject of an intransi-tive verb has the same case marking as the mostpatient-like argument of a transitive verb, typi-cally referred to as absolutive. The most agent-like argument of a transitive verb has a distinctcase marking, usually called ergative. In Chin-tang, ergative case is marked with an overt markerwhile absolutive is zero-marked. A number ofexceptions to the ergative-absolutive pattern arisedue to valence changing operations (such as re-flexive and benefactive). Other exceptions includevariable ergative marking on first and second per-son pronouns and an overt absolutive marker onthe pronoun TB� ‘who’.3Chintang’s flexible word order, scarcity of

overt case marking, and frequent argument drop-ping introduce challenges to grammar inference.First, the variety of phrase structure rules requiredto accommodate free word order in addition toargument optionally introduces potential for am-biguity to any implemented grammar. In somecases, this ambiguity is legitimate (in that multi-ple parses map to multiple semantic readings), butin other cases it may be an indication of under-

2S=subject, P=patient, G=goal, T=theme, A=agent3For a much more detailed account of the case frame for

various verb types, see Schikowski et al. 2015.

constrained rules. Second, the lack of overt ab-solutive case marking in Chintang, together withcommon pronoun dropping, results in relativelyfew overt case morphemes in the corpus for a syn-tactic inference script to use.

� "((3&("5*0/

The AGGREGATION project4 is dedicated to au-tomating the creation of grammars from IGT.5 Ta-ble 1 presents all the tools involved in the pipeline,with information on which task each performs.We elaborate on the pieces of the pipeline belowand encourage the reader to refer back to this tableas needed to track what each component is.As part of the AGGREGATION project, Ben-

der et al. (2014) present the first end-to-end studyof grammar inference from IGT by extractingbroad syntactic properties (namely, word order,case system and case frame) and morphologicalpatterns and testing the coverage of the result-ing grammars on held out data. They used themethodology of Bender et al. (2013) for syntac-tic inference and Wax (2014) for morphologicalinference. However, they left integrating the twoand creating grammars which benefit from bothtypes of information to future work.Like Bender et al. (2014), we take advantage of

the Grammar Matrix customization system (Ben-der et al., 2002, 2010), which creates precisiongrammars that emphasize syntactically accuratecoverage and attempt to minimize ambiguity. Ituses stored analysis for particular phenomena to

4?iiT,ff/2TibXr�b?BM;iQMX2/mfmr+Hf�;;`2;�iBQMf

5We are aware of one project with similar goals: Type-Gram (see e.g. Hellan and Beermann, 2014), couched inHPSG as well. Our pipeline places fewer expectations onthe IGT annotation, inferring phrase structure, part of speech,and valence automatically.

29

create a grammar based on a user’s specifica-tion of linguistic properties. These specificationsare recorded in a ‘choices file’. We follow themethodology of Bender et al. (2014) of formattingthe output of our inference systems as a choicesfile, so that it can be directly input to the GrammarMatrix for grammar customization.Unlike Bender et al. (2014), we take advantage

of the Xigt data model (Goodman et al., 2015).This extensible and flexible format encodes theinformation in IGT data in such a way that rela-tions between bits of information, such as the con-nection between a morpheme and its gloss, canbe easily identified. Data encoded with Xigt iscompatible with the INTENT system for enrichingIGT by parsing the English translation and pro-jecting that information onto the source languagetext (Georgi, 2016). Where Bender et al. (2014)use the methodology of Xia and Lewis (2007),in this work we use INTENT. We also use up-dated, Xigt-compatible versions of morphologicaland lexical class inference (Zamaraeva, 2016; Za-maraeva et al., 2017) and case system inference(Howell et al., 2017).

� .FUIPEPMPHZ

Our goal is to maximize the information that wecan learn about a language both morphologicallyand syntactically in order to produce grammarsthat parse strings with minimal ambiguity. Wepresent a methodology that analyzes morpholog-ical and syntactic information independently andthen creates lexical classes that share the informa-tion from both analyses.

�� .PSQIPUBDUJD JOGFSFODF XJUI .0.MOM is a system which infers morphotacticgraphs from IGT. Developed originally by Wax(2014) to infer position classes, it was updatedto work with the Xigt data model by Zamaraeva(2016) and to infer inflectional classes by Zama-raeva et al. (2017). Below we summarize the maininference algorithm shared across these differentversions. As with most work using MOM andthe Grammar Matrix, we target the morpheme-segmented line, and assume that the grammarsproduced will eventually need to be paired witha morphophonological analyzer to map to surfacespelling or pronunciation.MOM starts by reading in IGT that has been

enriched with information about each morpheme,

-are

am

port

cap

-o

-io fac

-ere

Figure 1: Sample graph MOM will initially build onthe example training data consisting of Latin verbs BN�BSF (‘to love’), QPSU�BSF (‘to carry’), QPSU�P (‘I carry’),DBQ�JP (‘I take’), GBD�JP (‘I do’), and GBD�FSF (‘to do’).

-are, -o

am, port

-ere, -io

cap, fac

Figure 2: Sample MOM output, after compressing thegraph in Figure 1 with a 50% overlap value.

including a part of speech (POS) tag of the wordit belongs to as well as whether it is an affix ora stem. Then MOM iterates over the items withthe relevant POS tag and builds a graph wherenodes are stems and affixes while edges are in-put relationships. For example, if MOM seesthe Latin verbs BN�BSF (‘to love’), QPSU�BSF (‘tocarry’), QPSU�P (‘I carry’), DBQ�JP (‘I take’), GBD�JP(‘I do’), and GBD�FSF (‘to do’) in the training data,it will create nodes and edges as depicted in Fig-ure 1. Finally, after the original graph has beenconstructed, MOM will recursively merge nodeswhich have edge overlap above the threshold pro-vided by the user (e.g. 50%, see Figure 2), in orderto discover position classes for affixes and inflec-tional classes for stems.

�� $BTF BOE 7BMFODF *OGFSFODFWhereas MOM creates lexical classes based onmorphotactics, the inference system described inthis section creates lexical classes for verbs basedon their valence in two steps: We first infer theoverall case system of the language and second in-fer whether each verb in the training data is transi-tive or intransitive. We then create lexical classesthat specify the argument structure and case re-quirements on those arguments.We begin by predicting the overall case sys-

tem, using the methodology developed by Bender

30

et al. (2013) and updated by Howell et al. (2017).We reimplement Bender et al.’s gram method,which counts the case grams in the corpus anduses a heuristic to assign nominative-accusative,ergative-absolutive, split-ergative or none to thelanguage based on the relative frequencies of thecase grams. We apply one change to this system:because the split-ergative prediction does not giveany information regarding the nature of the split(e.g. whether it is split according to features ofnouns or verbs), we map the split-ergative (whichit predicts for Chintang) to ergative-absolutive.To determine the transitivity of verbs in the cor-

pus, we take advantage of the English translationin the IGT. The dataset, which has been partiallyenriched by INTENT (Georgi, 2016),6 includesparse trees and POS tags for the English trans-lation. Furthermore, the Chintang words in thedataset are annotated for part of speech by theauthors of the corpus. To infer transitivity usingthis information, we first do a string comparisonbetween the gloss of the Chintang verb and eachword in the language line to identify the Englishverb that it corresponds to. If no match is found,for example the verb is glossed with IBWF, but theEnglish translation contains HFU instead, the verbis skipped. However, if a match is found, we tra-verse the English parse tree to check if the V issister to an NP. If so, it is classified as transitive,otherwise it is classified as intransitive. We ex-clude passive sentences from consideration; how-ever, under this algorithm, verbs that take a ver-bal complement are classified as intransitive. Weleave further fine-tuning in this respect to futurework. Finally, once we have the case system andthe transitivity for each verb, we assign the verb’scase frame according to a heuristic specific to thecase system. In the case of ergative-absolutive, wespecify ergative case on the subject of transitiveverbs and absolutive on the subject of intransitiveverbs and object of transitive verbs.

�� &YUFOTJPOT UP .PSQIPMPHJDBM "OBMZTJTWe modified the MOM system by extendingthe data structure that it expects as input withfields for transitivity and case frame and by mak-ing it check these fields every time it consid-ers which lexical class to assign to an item (wedescribe this process in more detail in §4.4).

6For this particular dataset, INTENT was not able to pro-duce bilingual alignments. This means that we were not ableto take advantage of word to word mappings.

We added functionality so that MOM infers caselexical rules for nouns. Next, we added func-tions to collapse all homonym stems in eachclass into one stem with a disjunctive predication(e.g. stem CFLUJ, predication @ZPVOHFTU�TPO�PS�ZPVOHFTU�NBMF�TJCMJOH@O@SFM).7 Furthermore, weimproved the way MOM is dealing with stemsthat occur bare in the data. Previously, MOMput all bare stems into a lexical class which couldnot merge with inflecting classes later, even if thesame stem occurred with affixes in other portionsof the data. Our fixing this problem led to fewerlexical classes in the grammar, which means bet-ter coverage potential. Finally, we made neces-sary additions to MOM’s output format so that allrelevant information for each lexical class, includ-ing case features and transitivity values, would beencoded into a valid choices file that can be cus-tomized by the Grammar Matrix.

�� *OUFHSBUJOH *OGFSFODF 4ZTUFNTWe integrated the systems described in §4.2 and§4.3 such that lexical types are defined accord-ing to the output of both systems. We expectour grammars produced in this fashion to outper-form either of the kinds of grammars producedby Bender et al. (2014), because our integratedverb classes contain both morphological and va-lence constraints, rather than leaving one of thesecategories underspecified.We run the syntactic inference system before

MOM, so that transitivity and case frame are in-cluded in MOM’s input. While MOM builds lex-ical classes based on morphology, it also checkstransitivity and case frame for compatibility be-fore adding an item to a lexical class. Verbs forwhich there is no case frame prediction are clas-sified in a ‘dummy’ category, rather than beingthrown out in order to maximize the morphologi-cal patterns that can be detected by MOM. That is,we want to include these verbs in the graph duringthe morphological inference process.8 However,without constraints on their valence, their inclu-sion in the final grammar would result in spuriousambiguity. Therefore, we allow these verbs in thegrammar, but change their orthography to contain

7This follows the conventions of Minimal Recursion Se-mantics, see Copestake et al. 2005 and Flickinger et al. 2014.

8As will be mentioned in §5.2.4, we also allow MOM tomerge together verbs with no valence information into lexi-cal classes with valence constraints, based on morphologicalpatterns.

31

Grammar spec Origin descriptionoracle Bender et al. 2012b Expert-constructedbaseline Bender et al. 2014 Full-form lexicon, free word order (WO), no casemom-default-none Bender et al. 2014 MOM-inferred lexicon, free WO, no caseff-auto-gram Bender et al. 2014 Full-form lexicon, V-final WO, erg-abs case systemintegrated This paper MOM-inferred lexicon, case frames, free WO, erg-abs

Table 2: Grammars used in evaluation

a non-Chintang string, so that they are effectivelyexcluded from the working grammar unless theywere merged into a verb class with valid valenceconstraints.

� $BTF 4UVEZ

So far we have described a methodology that ex-tracts both syntactic and morphological informa-tion from IGT data, and outputs this informationin a format compatible with the Grammar Ma-trix customization system. Now we present acase study of this system on a Chintang dataset,described in §5.1. We describe our comparisongrammars as well as the output of our system(§5.2) and grammar development and parsing pro-cess (§5.3) before presenting the results in §6 andthe discussion in §7.

�� %BUBTFU

For our experiment, we use the same subsetof the Chintang Language Resource Program(CLRP)9 corpus as Bender et al. (2014), whichcontains 10862 instances of Interlinnear GlossedText. These IGT comprise narratives and otherrecorded speech that were transcribed, translatedand glossed by the CLRP. Example (1) illustratesthe thorough glossing of IGT in this corpus. How-ever, it is noteworthy, especially for the purposesof inference, that syntactic characteristics that donot correspond with an overt morpheme (such assg and abs) are not glossed in this data. We usethe same train (8863 IGT), dev (1069 IGT) andtest (930 IGT) splits as Bender et al. (2014).

(1) unisaŋau-nisa-ŋa3sposs-younger.brother-erg.a

khattekhatt-etake-ind.pst

momodem.down

kosiʔkosi-iʔriver-loc

mobamo-pedem.down-loc

‘The younger brother took it to the river.’ [ctn](Bickel et al., 2013)

9https://www.uzh.ch/clrp/

b2+iBQM4rQ`/@Q`/2`rQ`/@Q`/2`47`22?�b@/2ib4MQXXXb2+iBQM4+�b2+�b2@K�`FBM;42`;@�#b2`;@�#b@2`;@+�b2@M�K242`;2`;@�#b@�#b@+�b2@M�K24�#b

+�b2RnM�K24HQ++�b2knM�K24/�iXXX

XXXp2`#RykNnp�H2M+24BMi`�Mbp2`#RykNnbi2KknT`2/4n+`vnpn`2Hp2`#RykNnbi2KknQì?4`�iiXXXp2`#@T+enBMTmib4p2`#@T+R- p2`#RykNp2`#@T+enHìRnH`BRnQì?4@�p2`#@T+enHìRnH`BRnBM7H2+iBM;4v2b

Figure 3: Excerpts from a choices file

We use a version of the training portion ofthe data that has been converted to the Xigt datamodel. The converter skipped 169 IGT, resultingin a slightly smaller training set than that used byBender et al. (2014). This dataset was enrichedwith English parses and part of speech tags usingINTENT (Georgi, 2016).

�� (SBNNBS 4QFDJ˳DBUJPOThe output of the morphotactic and syntactic in-ference scripts described earlier in this section isencoded in a ‘choices file’, illustrated in Figure 3,which is the suitable input to the Grammar Ma-trix customization system (Bender et al., 2010).In this subsection, we describe the choices fileswe developed for the baseline, oracle and our in-ference system, as well as the comparison choicesfrom the 2014 experiment.10

�� 0SBDMF $IPJDFT 'JMFOur first point of comparison is a manually con-structed ‘oracle’ choices file, from Bender et al.10For this comparison, we regenerated the baseline and

the oracle grammars from their choices files. For mom-default-none and ff-auto-gram we worked from the sameparsing results analyzed in Bender et al. 2014.

32

2012b. This choices file was developed by im-porting CLRP’s Toolbox lexicon and defining therest of the specifications by hand, based on lin-guistic analysis. As a result, the grammar pro-duced by this choices file is expected to have veryhigh precision, with moderate recall. It speci-fies the word order as verb-final (correspondingto the most frequent word orders observed bySchikowski et al. (2015)) and the case system asergative-absolutive, and it defines both subject andobject dropping as possible for all verbs. It alsoincludes hand-specified lexical rules based on theanalysis in Schikowski 2012.Because the Grammar Matrix only included

simple transitive and intransitive verbs at the timethis choices file was developed, only those twoclasses were defined. Their case frames werespecified by hand such that the intransitive verbclass has an absolutive subject and the transitiveverb class has an ergative subject and an absolu-tive object. Finally, because the Grammar Ma-trix did not support adjectives, adverbs, and manyother lexical categories, the resulting grammaris not expected to parse any sentence containingthose lexical categories. Conveniently, the limi-tations of the grammar defined by Bender et al.(2012b) make it a good comparison to the gram-mar we are able to produce using only the infer-ence described in this paper.11

�� #BTFMJOF $IPJDFT 'JMFAs a second point of comparison, we use a choicesfile that is designed to create a working grammarwith a sufficient lexicon that is otherwise naïvewith respect to the particular grammatical char-acteristics of Chintang. We take this choices filefrom Bender et al. (2014), but make some modi-fications. The lexicon was extracted according tothe methodology of Xia and Lewis (2007), defin-ing bare bones classes for nouns, verbs and de-terminers. Because the inference system we usein the this paper does not consider determiners,we removed them from the baseline choices file.Finally, the baseline predicts free word order, nocase system and argument dropping for subjectsand objects of all verbs because these choices areexpected to result in the highest coverage (thoughlow precision) for any language.12

11We do still have to add some default choices to our in-ferred grammar, as described in §5.2.4.12It is incidental that these word order and argument

dropping predictions are arguably valid analyses for this

�� #FOEFS FU BM� �� DIPJDFT ˳MFTOur final comparison is to the results for thechoices files created by Bender et al. (2014)that are most comparable to our present work.We look at their mom-default-none and ff-auto-gram grammars which represent the two types ofinference that we have integrated.The mom-default-none choices file uses a lexi-

con produced by MOM and the default choices forword order (free), case system (none), and argu-ment optionality (all arguments are optional). Theff-auto-gram uses a lexicon of full word forms,as in the baseline but with case frames for eachverb as observed in the data. The word order (V-final) and case system (ergative-absolutive) wereinferred using syntactic inference.

�� *OUFHSBUFE *OGFSFODF 0VUQVUWe produced the inferred choices file by extract-ing the lexicon, case system and morphologicalrules, as described in §4.1, §4.2 and §4.3. We ranthe inference system on training data, debuggingand selecting an overlap value by parsing the devdata with the resulting grammars. Our system pre-dicted ergative-absolutive case and a robust lexi-con and set of morphological rules. We used thedefault word order (free) and default argument op-tionality choices (any verb can drop subject or ob-ject) from Bender et al. 2014. We also added thenecessary choices for the ‘dummy verb’ categorydescribed in §4.3.

$IPPTJOH UIF PWFSMBQ WBMVF We ran MOM with10 overlap values from 0.1 to 1.0. Then we parsedsentences from the development set to identify thegrammar that optimizes both coverage and ambi-guity. For these data, that showed the optimumoverlap value to be 0.2.13 We inspected the dif-ferences in coverage and verified that they fallinto three categories. First, some lexical entrieswhich originally lacked a valence frame (the in-ference script was not able to assign one) success-fully merged into lexical classes which did have avalid valence frame.14 Second, some entries hap-pened to merge into only an intransitive class inone grammar but only into a transitive class in an-other. Finally, upon merging into lexical classes,

language.13The 0.1 grammar had marginally lower ambiguity (2.76

fewer parses on average), but it parsed two fewer sentences.14We assume that if a verb without a known valence frame

has similar morphotactic possibilities as a verb with a knownvalence frame, it has the same valence frame.

33

lexical entries gained access to morphological pat-terns which were unseen in the training data. Thishappens with both noun and verb lexical entries.For example, not all grammars that we producedwere able to successfully produce a parse tree for(2), because they did not merge the stem TJM witha class that is compatible with the prefix V�.

(2) u-3sa-

silbite.and.pull.out

-u-3p-set-destr.tr

-kV-3p

-ce-ind.npst -3nsp‘They snatch and kill them’ [ctn] (Bickelet al., 2013)

�� (SBNNBS DVTUPNJ[BUJPO BOE QBSTJOHAfter producing the choices files, we used theGrammar Matrix customization system15 to pro-duce customized grammars. We loaded these intothe LKB parsing software (Copestake, 2002) andused (BM+` ib/#UV) (Flickinger and Oepen, 1998)to create stored profiles of syntactic and seman-tic parses. We treebanked these parses using(BM+` ib/#UV) to identify correct parses, or parsesthat produce the desired semantic representationfor the sentence. In particular, we checked tomake sure that the predicate-argument structurematched what was indicated in the gloss, but didnot require information such as negation, personand number or tense and aspect (all morphologi-cally marked in Chintang), as our system doesn’tyet attempt to extract these.

� 3FTVMUTTo put the results of parsing the strings from Chin-tang in context, we first describe the producedgrammars in terms of their size. Table 3 reportsthe size of the lexicons and the number of affixesof each grammar. The oracle grammar’s lexi-con includes the imported Toolbox lexicon fromCLRP, so it includes many more stems than theothers. The baseline and ff-auto-gram lexiconsinclude full form lexical entries,16 while the gram-mar produced by our integrated system has lex-icons that include stems extracted from the train-ing data. mom-default-none only did morpholog-ical analysis on verbs; for nouns it includes full-form entries. The oracle grammar has a numberof morphological rules for nouns and verbs that15svn://lemur.ling.washington.edu/shared/matrix/trunk at

revision 4196916Note that the ff-auto-gram grammar only included

verbs for which a case frame could be predicted.

were hand-crafted, while mom-default-none andintegrated’s lexical rules were extracted from thetraining data.The results are reported in Table 4. ‘Lexical

coverage’ is the number of sentences for whichthe grammar could produce an analysis (via fullform lexical entry or morphological rules) for ev-ery word form. These numbers are quite smallbecause there are no lexical entries for categoriesother than nouns and verbs. ‘Parsed’ shows thenumber of sentences in test data that receivedsome spanning syntactic analysis, and ‘correct’the number of items for which there was a correctparse, according to manual inspection and tree-banking. Finally we report the number of read-ings, which shows the degree of ambiguity in thegrammar. Our integrated system had the highestcoverage as well as the highest correct coverage,but also had the most ambiguity.

� %JTDVTTJPO

We expect integrated to have broader cover-age than ff-auto-gram and baseline because itincludes morphological rules allowing it to gen-eralize to unseen entries. We also expect inte-grated to have higher precision (a higher propor-tion of correct parses) than mom-default-none be-cause unlike mom-default-none, it integrates caseframes which can rule out spurious analyses. Wealso expect it to have higher coverage than mom-default-none because it includes inferred mor-phological rules for nouns (in addition to verbs).Though the absolute numbers are small, these pre-dictions are borne out by the data in Table 4.To get a better sense of the differences between

the systems, we performed an error analysis. Welooked at all the parsed items (not just the tree-banked ones) to get a broader view into the be-havior of the grammars. This section provides theresults of our analysis as well as some explorationinto the higher ambiguity found by integrated.

�� &SSPS "OBMZTJTAs expected (and as with the other grammars), theparsing errors for integrated are due to either lex-ical or syntactic failures. For 825 items, the parserdid not succeed in MFYJDBM BOBMZTJT. In principle,this includes both lack of stem or affix forms andfailures due to the grammar’s inability to constructthe morphological pattern, even though all mor-phemes are found in the grammar. We examined a

34

Choices file # verb entries # noun entries # verb affixes # noun affixesoracle 899 4750 233 36baseline 3005 1719 0 0ff-auto-gram 739 1724 0 0mom-default-none 1177 1719 262 0integrated 911 1755 220 76

Table 3: Amount of lexical information in each choices file

choices file lexical coverage (%) parsed (%) correct (%) readingsoracle 116 (12.5) 20 (2.2) 10 (1.1) 1.35baseline * 38 (0.4) 15 (1.6) 8 (0.9) 27.67ff-auto-gram 18 (1.9) 4 (0.4) 2 (0.2) 5.00mom-default-none 39 (4.2) 16 (1.7) 3 (0.3) 10.81integrated 105 (11.3) 32 (3.4) 15 (1.6) 91.56* We report slightly different results for lexical coverage and average readings for the baselinethan Bender et al. (2014) because we removed determiners from the choices file.

Table 4: Results on 930 held-out sentences

sample of 50 such items and only found instancesof missing stems and affixes, and no failed com-binatorics. The remaining 73 errors are accountedfor on the TZOUBDUJD MFWFM. These break into threecategories: (1) both a V/VP and an NP could beformed, but the NP had a case marker incompat-ible with the verb’s case frame (e.g. locative; 6items of this kind total);17 (2) a sentence did notcontain a word which could be analyzed as a verbby the grammar (only as a noun, e.g. a sentencefragment; 23 total); (3) finally, the sentence wascomplex, i.e., contained more than one verb. Thisthird category was the most common (44 total),as the grammar does not include subordination orcoordination rules.We also compared our results on the held-out

data to the baseline and the oracle grammars.While integrated outperforms both baseline andoracle, baseline and oracle parse some sen-tences that integrated does not.

*OUFHSBUFE WT� PSBDMF Comparison between or-acle and integrated yields 130 different results.Of these, most are due to differences in lexicalanalysis. In particular, there are 55 items whichfail due to lexical analysis in integrated but faildue to syntactic analysis in oracle. In all of thesecases, our grammar lacked a lexical entry that theoracle grammar had; this is expected as the oraclelexicon is based on a different source. There are38 items for which oracle cannot provide lexical

17What is missing here is a grammar rule that handles e.g.locative NPs functioning as modifiers.

analysis and integrated can but fails at the syn-tactic stage. Of these, most are missing stems andaffixes in the oracle grammar but for one item,oracle is actually lacking the required affix or-dering that integrated picks up from the trainingdata. In addition, there are 11 cases where oraclefails at lexical analysis and integrated succeedsat both lexical and syntactic analysis. In two ofthose eleven cases, integrated outperforms or-acle due to the robustness of the morphologicalrules, not the lexicon. In contrast, there are 3items at which integrated fails lexically and ora-cle gives a parse, all due to lexicon differences. In6 cases, oracle fails at the syntactic stage whereintegrated succeeds. Of these, 1 was rejected intreebanking,18 two items are true wins due to mor-photactic inference for nouns; two are because or-acle only has a noun entry for something whichintegrated picked up as a verb, and finally oneis parsed by integrated because it admits head-initial word orders while oracle insists on V-final.In contrast, oracle can parse one item for whichintegrated can perform lexical analysis but failsto parse. The sentence is DPS DPS and oracle hasboth a verb and a noun entry for the word DPSwhile integrated does not.

*OUFHSBUFE WT� #BTFMJOF The difference betweenintegrated and baseline is mainly due to lexi-cal coverage. The baseline grammar featuring

18If none of the parses have a structure and a semanticrepresentation that are meaningful with respect to the trans-lation, all parses for the item are rejected.

35

snpncuwa

vpvpvpvpvpvp

mai-yuŋ-th-a-k-e

sv

npncuwa

vvvvvv

mai-yuŋ-th-a-k-e

Figure 4: Analyses like the one above (for (3)) usehomonymous intransitive (left) and transitive verb en-try with a dropped argument (right). Together withthe homonymous verb position classes, this producesambiguity.

full-form lexical entries can lexically analyze 6items that integrated cannot, while integratedanalyzes 22 items that baseline cannot. Neitherwins at syntactic analysis.19

�� "NCJHVJUZintegrated produces noticeably more trees persentence than either the oracle grammar or thebaseline grammar, on average (see ‘readings’ col-umn in Table 4). The baseline’s low ambiguity isnot striking, because it only parses a few syntacti-cally and morphologically simple sentences. Still,on some items the baseline grammar produceshundreds of trees. The oracle grammar clearlyhas less ambiguity. The main reason for moreambiguity in integrated is simple combinatorics.We infer noun and verb inflectional and positionclasses, and often end up with homophonous af-fixes as well as homophonous transitive and in-transitive entries for verbs. For example, the verbZVċ, meaning ‘sit’, ‘be’ or ‘squat’, is associatedwith both a transitive and an intransitive entry inintegrated. Figure 4 illustrates two of the treesthat this grammar finds for (3).

(3) cuwawater

mai-yuŋ-th-a-k-eneg-be.there-neg-pst-ipfv-ind.pst

‘Was there water?”

These homophonous lexical resources combinewith the argument optionality rules, word order19A few of these differences are due to the fact that base-

line includes pronouns and other word categories (such asinterjections) as ‘nouns’ in the lexicon. We built integratedassuming only things marked as nouns go in. In future work,we will include pronouns (but not interjections).

flexibility and the actual complexity of Chintangmorphotactics to create a large search space forthe parser.

�� 'VUVSF XPSLThe default setting in MOM is to produce mor-phological rules which are optional. Further-more, MOM does not yet infer non-inflecting lex-ical rules. This means that uninflected forms arepassed through to the syntax without being associ-ated with the morphosyntactic or morphosemanticinformation that the zero in the paradigm actu-ally reflects. In future work, we will explore howto automatically posit such zero-marked rules,including how to make sure that their positionclasses are required, so that the grammar can prop-erly differentiate ‘zero’ and ‘uninflected’.We plan to extend our syntactic inference al-

gorithm to account for verbs with alternate or‘quirky’ case frames. Another avenue that ourerror analysis shows as particularly promising isto handle complex clauses, as there are tools tomodel coordination (Drellishak and Bender, 2005)and subordination (Howell and Zamaraeva, 2018;Zamaraeva et al., 2019) in the Grammar Matrixframework.

� $PODMVTJPO

We have demonstrated the value of integratingmorphological and syntactic inference for auto-matic grammar development. Although inferringthese properties is most easily handled separately,we show that combining information about mor-photactic and inflectional patterns with syntacticproperties such as transitivity improves coverage.While this study looked at case and transitivity,the benefits of creating lexical classes that encodesyntactic information alongside morphological in-formation should generalize. This methodologycan be extended to other linguistic phenomena onthe morphosyntactic interface, such as agreement,and the coverage of grammars can be further ex-tended by expanding the lexical classes and clausetypes that can be inferred from the syntax. Inthe future, we would like to perform further, in-depth studies in collaboration with documentarylinguists, for example to see if our system can helprefine the analysis of morphological classes in thelexicon of the language in question and whethera grammar fragment automatically produced thisway can be easily extended to broader coverage.

36

"DLOPXMFEHNFOUT

This material is based upon work supported bythe National Science Foundation under Grant No.BCS-1561833. Any opinions, findings, and con-clusions or recommendations expressed in thismaterial are those of the author(s) and do not nec-essarily reflect the views of the National ScienceFoundation.We would like to thank Alex Burrell for pro-

viding code from a previous project that we builton in order to infer transitivity.We thank previous AGGREGATION research

assistants for converting the corpus into Xigt andenriching it with INTENT.

3FGFSFODFTEmily M Bender, Dan Flickinger, and Stephan Oepen.2002. The Grammar Matrix: An open-sourcestarter-kit for the rapid development of cross-linguistically consistent broad-coverage precisiongrammars. In John Carroll, Nelleke Oostdijk, andRichard Sutcliffe, editors, 1SPDFFEJOHT PG UIF 8PSL�TIPQ PO (SBNNBS &OHJOFFSJOH BOE &WBMVBUJPO BUUIF ��UI *OUFSOBUJPOBM $POGFSFODF PO $PNQVUBUJPOBM-JOHVJTUJDT, pages 8–14, Taipei, Taiwan, 2002.

Emily M Bender, Scott Drellishak, Antske Fokkens,Laurie Poulson, and Safiyyah Saleem. 2010. Gram-mar customization. 3FTFBSDI PO -BOHVBHF � $PN�QVUBUJPO, 8:1–50. ISSN 1570-7075.

Emily M. Bender, Dan Flickinger, and Stephan Oepen.2011. Grammar engineering and linguistic hypoth-esis testing: Computational support for complexityin syntactic analysis. In Emily M. Bender and Jen-nifer E. Arnold, editors, -BOHVBHF GSPN B $PHOJ�UJWF 1FSTQFDUJWF� (SBNNBS 6TBHF BOE 1SPDFTTJOH,pages 5–29. CSLI Publications, Stanford, CA.

Emily M. Bender, Sumukh Ghodke, Timothy Baldwin,and Rebecca Dridan. 2012a. From database to tree-bank: Enhancing hypertext grammars with gram-mar engineering and treebank search. In SebastianNordhoff and Karl-Ludwig G. Poggeman, editors,&MFDUSPOJD (SBNNBUJDPHSBQIZ, pages 179–206. Uni-versity of Hawaii Press, Honolulu.

Emily M Bender, Robert Schikowski, and BalthasarBickel. 2012b. Deriving a lexicon for a precisiongrammar from language documentation resources:A case study of Chintang. 1SPDFFEJOHT PG $0-*/(��, pages 247–262.

Emily M Bender, Michael Wayne Goodman, JoshuaCrowgey, and Fei Xia. August 2013. To-wards creating precision grammars from interlin-ear glossed text: Inferring large-scale typologicalproperties. In 1SPDFFEJOHT PG UIF �UI 8PSLTIPQ

PO -BOHVBHF 5FDIOPMPHZ GPS $VMUVSBM )FSJUBHF 4P�DJBM 4DJFODFT BOE )VNBOJUJFT, pages 74–83, Sofia,Bulgaria, August 2013. Association for Compu-tational Linguistics. URL ?iiT,ffrrrX�+Hr2#XQ`;f�Mi?QHQ;vfqRj@kdRy.

Emily M Bender, Joshua Crowgey, Michael WayneGoodman, and Fei Xia. June 2014. Learninggrammar specifications from IGT: A case studyof Chintang. In 1SPDFFEJOHT PG UIF �� 8PSL�TIPQ PO UIF 6TF PG $PNQVUBUJPOBM .FUIPET JO UIF4UVEZ PG &OEBOHFSFE -BOHVBHFT, pages 43–53, Bal-timore, Maryland, USA, June 2014. Associationfor Computational Linguistics. URL ?iiT,ffrrrX�+Hr2#XQ`;f�Mi?QHQ;vfqR9@kkye.

Balthasar Bickel, Manoj Rai, Netra Paudyal, GomaBanjade, Toya Nath Bhatta, Martin Gaenszle, ElenaLieven, Iccha Purna Rai, Novel K Rai, and SabineStoll. 2010. The syntax of three-argument verbsin Chintang and Belhare (Southeastern Kiranti).4UVEJFT JO EJUSBOTJUJWF DPOTUSVDUJPOT� B DPNQBSBUJWFIBOECPPL, pages 382–408.

Balthasar Bickel, Martin Gaenszle, Novel KishoreRai, Vishnu Singh Rai, Elena Lieven, SabineStoll, G. Banjade, T. N. Bhatta, N Paudyal,J Pettigrew, and M Rai, I. P.and Rai. 2013.Tale of a poor guy. URL ?iiTb,ff+Q`TmbRXKTBXMHf[7bRfK2/B�@�`+?Bp2f/Q#2bn/�i�f*?BMi�M;SmK�f*?BMi�M;fL�``�iBp2bf�MMQi�iBQMbfT?2M;MB#�ni�H2Xi#i. Accessed:22 October 2018.

Ann Copestake. 2002. *NQMFNFOUJOH UZQFE GFBUVSFTUSVDUVSF HSBNNBST. CSLI publications, Stanford.

Ann Copestake, Dan Flickinger, Carl Pollard, andIvan A Sag. 2005. Minimal recursion semantics:An introduction. 3FTFBSDI PO -BOHVBHF BOE $PN�QVUBUJPO, 3(2-3):281–332.

Scott Drellishak and Emily Bender. 2005. A coordina-tion module for a crosslinguistic grammar resource.In *OUFSOBUJPOBM $POGFSFODF PO )FBE�%SJWFO 1ISBTF4USVDUVSF (SBNNBS, volume 12, pages 108–128.URL ?iiT,ffr2#Xbi�M7Q`/X2/mf;`QmTf+bHBTm#HB+�iBQMbf+bHBTm#HB+�iBQMbf>Sa:fkyy8f/`2HHBb?�F@#2M/2`XT/7.

Dan Flickinger, Emily M Bender, and Stephan Oepen.2014. ERG semantic documentation. URL ?iiT,ffrrrX/2HT?@BMXM2if2b/. Accessed on 2018-10-22.

Daniel P. Flickinger and Stephan Oepen. 1998. To-wards systematic grammar profiling. Test suite tech-nology ten years after. In #FJUS¤HF [VS �� 'BDIUB�HVOH EFS 4FLUJPO $PNQVUFSMJOHVJTUJL EFS %(G4, Hei-delberg, 1998.

Antske Sibelle Fokkens. 2014. &OIBODJOH &NQJSJ�DBM 3FTFBSDI GPS -JOHVJTUJDBMMZ .PUJWBUFE 1SFDJTJPO(SBNNBST. PhD thesis, Department of Computa-tional Linguistics, Universität des Saarlandes.

37

Ryan Georgi. 2016. 'SPN "BSJ UP ;VMV� .BTTJWFMZ.VMUJMJOHVBM $SFBUJPO PG -BOHVBHF 5PPMT 6TJOH *O�UFSMJOFBS (MPTTFE 5FYU. PhD thesis, University ofWashington.

Michael Wayne Goodman, Joshua Crowgey, Fei Xia,and Emily M Bender. 2015. Xigt: Extensibleinterlinear gloss text for natural language process-ing. -BOHVBHF 3FTPVSDFT BOE &WBMVBUJPO, 49 (2):455–485.

Lars Hellan and Dorothee Beermann. 2014. Induc-ing grammars from IGT. In Mariani J. Vetulani Z.,editor, )VNBO -BOHVBHF 5FDIOPMPHZ $IBMMFOHFT GPS$PNQVUFS 4DJFODF BOE -JOHVJTUJDT�, volume 8287of -5$ �� -FDUVSF /PUFT JO $PNQVUFS 4DJFODF.Springer.

Kristen Howell and Olga Zamaraeva. 2018. Clausalmodifiers in the Grammar Matrix. In 1SPDFFEJOHT PGUIF ��UI *OUFSOBUJPOBM $POGFSFODF PO $PNQVUBUJPOBM-JOHVJTUJDT, pages 2939–2952.

Kristen Howell, Emily M Bender, Michael Lockwood,Fei Xia, and Olga Zamaraeva. 2017. Inferring casesystems from IGT: Impacts and detection of variableglossing practices. $PNQVU&-��, pages 67–75.

Stefan Müller. 1999. %FVUTDIF 4ZOUBY EFLMBSB�UJW� )FBE�%SJWFO 1ISBTF 4USVDUVSF (SBNNBS G¼S EBT%FVUTDIF. Number 394 in Linguistische Arbeiten.Max Niemeyer Verlag, Tübingen.

Stefan Müller. 2015. The CoreGram project: The-oretical linguistics, theory development and ver-ification. +PVSOBM PG -BOHVBHF .PEFMMJOH, 3(1):21–86. URL ?iiTb,ff?Tb;X?m@#2`HBMX/2f�bi27�MfSm#f+Q`2;`�KX?iKH.

Stefan Müller. 2016. (SBNNBUJDBM UIFPSZ� 'SPNUSBOTGPSNBUJPOBM HSBNNBS UP DPOTUSBJOU�CBTFE BQ�QSPBDIFT. Language Science Press.

Robert Schikowski. Chintang morphology. Unpub-lished ms, University of Zürich, 2012.

Robert Schikowski, NP Paudyal, and Balthasar Bickel.2015. Flexible valency in Chintang. 7BMFODZ$MBTTFT� B $PNQBSBUJWF )BOECPPL.

Sabine Stoll and Balthasar Bickel. 2012. Howto measure frequency? Different ways of count-ing ergatives in Chintang (Tibeto-Burman, Nepal)and their implications. In Frank Seifart, Geof-frey Haig, Nikolaus P. Himmelmann, Dagmar Jung,Anna Margetts, and Paul Trilsbeek, editors, 1PUFO�UJBMT PG -BOHVBHF %PDVNFOUBUJPO� .FUIPET "OBM�ZTFT BOE 6UJMJ[BUJPO, pages 83–89. University ofHawai’i Press.

David Wax. 2014. Automated grammar engineeringfor verbal morphology. Master’s thesis, Universityof Washington.

Fei Xia and William D. Lewis. 2007. Multilingualstructural projection across interlinear text. In 1SPD�PG UIF $POGFSFODF PO )VNBO -BOHVBHF 5FDIOPMPHJFT)-5�/""$- ��, pages 452–459, Rochester,New York, 2007.

Olga Zamaraeva. 2016. Inferring morphotactics frominterlinear glossed text: combining clustering andprecision grammars. In 1SPDFFEJOHT PG UIF ��UI 4*(�.031)0/ 8PSLTIPQ PO $PNQVUBUJPOBM 3FTFBSDIJO 1IPOFUJDT 1IPOPMPHZ BOE .PSQIPMPHZ, pages141–150.

Olga Zamaraeva, František Kratochvíl, Emily M Ben-der, Fei Xia, and Kristen Howell. 2017. Compu-tational support for finding word classes: A casestudy of Abui. In 1SPDFFEJOHT PG UIF �OE 8PSLTIPQPO UIF 6TF PG $PNQVUBUJPOBM .FUIPET JO UIF 4UVEZ PG&OEBOHFSFE -BOHVBHFT, pages 130–140.

Olga Zamaraeva, Kristen Howell, and Emily M.Bender. 2019. Modeling clausal com-plementation for a grammar engineering re-source. In 1SPDFFEJOHT PG UIF 4PDJFUZ GPS $PN�QVUBUJPO JO -JOHVJTUJDT, volume 2, page Article6. URL ?iiTb,ffb+?QH�`rQ`FbXmK�bbX2/mfb+BHfpQHkfBbbRfef.

38


Finding Sami Cognates with a Character-Based NMT Approach

Mika HämäläinenDepartment of Digital Humanities

University of [email protected]

Jack RueterDepartment of Digital Humanities


Abstract

We approach the problem of expanding theset of cognate relations with a sequence-to-sequence NMT model. The language pair ofinterest, Skolt Sami and North Sami, has toolimited a set of parallel data for an NMT modelas such. We solve this problem on the onehand, by training the model with North Samicognates with other Uralic languages and, onthe other, by generating more synthetic train-ing data with an SMT model. The cognatesfound using our method are made publiclyavailable in the Online Dictionary of UralicLanguages.

1 Introduction

Sami languages have received a fair share of in-terest in purely linguistic study of cognate rela-tions. Although various schools of Finno-Ugricstudies have postulated contrastive interpretationsof where the Sami languages should be locatedwithin the language family, there is strong ev-idence demonstrating regular sound correspon-dence between Samic and Balto-Finnic, on the onehand, and Samic and Mordvin, on the other. Theimportance of this correspondence is accentuatedby the fact that the Samic might provide insightfor second syllable vowel quality, as not all Samic-Mordvin vocabulary is attested in Balto-Finnic (cf.Korhonen, 1981). The Sami languages themselves(there are seven written languages) also exhibitregular sound correspondence, even though cog-nates, at times, may be opaque to the layman.One token of cognate relation studies is the Álgudatabase (Kotus, 2006), which contains a set ofinter-Sami cognates. Cognates have applicabil-ity in NLP research for low-resource languagesas they can, for instance, be used to induce thepredicate-argument structures from bilingual vec-tor spaces (Peirsman and Padó, 2010).

The main motivation for this work is to extendthe known cognate information available in theOnline Dictionary of Uralic Languages (Hämäläi-nen and Rueter, 2018). This dictionary, at its cur-rent stage, only has cognate relations recorded inthe Álgu database.

Dealing with true cognates in a non-attested hy-pothetical proto-language presupposes adherenceto a set of sound correlations posited by a givenschool of thought. Since Proto-Samic is one suchlanguage, we have taken liberties to interpret theterm cognate in the context of this paper morebroadly, i.e. not only words that share the samehypothesized origin in Proto-Samic are consideredcognates (hence forth: true cognates), but alsoitems that might be deemed loan words acquiredfrom another language at separate points in thetemporal-spatial dimensions. This more permis-sive definition makes it possible to tackle the prob-lem computationally easier given the limitationimposed by the scarcity of linguistic resources.

Our approach does not presuppose a seman-tic similarity of the meaning of the cognate can-didates, but rather explores cognate possibilitiesbased on grapheme changes. The key idea is thatthe system can learn what kinds of changes arepossible and typical for North Sami cognates withother Uralic languages in general. Taking leveragefrom this more general level knowledge, the modelcan learn the cognate features between North Samiand Skolt Sami more specifically.

We assimilate this problem with that of normal-ization of historical spelling variants. On a higherlevel, historical variation within one language canbe seen as discovering cognates of different tem-poral forms of the language. Therefore, we wantto take the work done in that vein for the first timein the context of cognate detection. Using NMT(neural machine translation) on a character levelhas been shown to be the single most accurate

39

method in normalization by a recent study withhistorical English (Hämäläinen et al., 2018).

In this paper, we use NMT in a similar characterlevel fashion for finding cognates. Furthermore,due to the limited availability of training data, wepresent an SMT (statistical machine translation)method for generating more data to boost the per-formance of the NMT model.

2 Related Work

Automatic identification of cognates has receiveda fair share of interest in the past from differentmethodological stand points. In this section, wewill go through some of these approaches.

Ciobanu and Dinu (2014) propose a methodbased on orthographic alignment. This means acharacter level alignment of cognate pairs. Afterthe alignment, the mismatches around the alignedpairs are used as features for the machine learningalgorithm.

Another take on cognate detection is that ofRama (2016). This approach employs Siameseconvolutional networks to learn phoneme levelrepresentation and language relatedness of words.They based the study on Swadesh lists and usedhand-written phonetic features and 1-hot encodingfor the phonetic representation.

Cognate detection has also been done by look-ing at features such as semantics, phonetics andregular sound correspondences (St. Arnaud et al.,2017). Their approach implements a generalmodel and language specific models using supportvector machine (SVM).

Rama et al. (2017) present an unsupervisedmethod for cognate identification. The methodconsists of extracting suitable cognate pairs withnormalized Levenshtein distance, aligning thepairs and counting a point-wise mutual informa-tion score for the aligned segments. New setsof alignments are generated and the process ofaligning and scoring is repeated until there are nochanges in the average similarity score.

3 Finding Cognates

In this section, we describe our proposed approachin finding cognates between North Sami and SkoltSami. We present the dataset used for the trainingand an SMT approach in generating more trainingdata.

3.1 The Data

Our training data consists of Álgu (Kotus, 2006),which is an etymological database of the Sami lan-guages. From this database, we use all the cognaterelations recorded for North Sami to all the otherFinno-Ugric languages in the database. This pro-duces a parallel dataset of North Sami words andtheir cognates in other languages.

The North Sami to other languages paralleldataset consists of 32905 parallel words, of which2633 items represent the correlations betweenNorth Sami and Skolt Sami.

We find cognates for nouns, adjectives, verbsand adverbs recorded in the Giellatekno dictionar-ies (Moshagen et al., 2013) for North Sami andSkolt Sami. These dictionaries serve as an inputfor the trained NMT model and for filtering theoutput produced by the model.

3.2 The NMT Model

For the purpose of our research we use OpenNMT(Klein et al., 2017) to train a character based NMTmodel that will take a Skolt Sami word as its in-put and produce a potential North Sami cognate asits output. We use the default settings for Open-NMT1.

We train a sequence to sequence model with thelist of known cognates in other languages as thesource data and their North Sami counterparts asthe target data. In this way, the system learns agood representation of the target language, NorthSami, and can learn what kind of changes arepossible between cognates in general. Thus, themodel can learn additional information about cog-nates that would not be present in the North Sami-Skolt Sami parallel data.

In order to make the model adapt more tothe North Sami-Skolt Sami pair in particular, wecontinue training the model with only the NorthSami-Skolt Sami parallel data for an additional10 epochs. The idea behind this is to bring themodel closer to the language pair of interest inthis research, while still maintaining the additionalknowledge it has learned about cognates in generalfrom the larger dataset.

3.3 Using SMT to Generate More Data

Research in machine translation has shown thatgenerating more synthetic parallel data that can be

1Version from the project’s master branch on the 13 Aprilof 2018

40

noisy in the source language end but is not noisyin the target end, can improve the overall transla-tions of an NMT model (Sennrich et al., 2015). Inlight of this finding, we will try a similar idea inour cognate detection task as well.

Due to the limited amount of North Sami-SkoltSami training data available, we use SMT insteadof NMT to train a model that will produce plausi-ble but slightly irregular Skolt Sami cognates forthe word list of North Sami words obtained fromthe Giellatekno dictionaries.

We use Moses (Koehn et al., 2007) baseline2 totrain a translation model to the opposite directionof the NMT model with the same parallel data.This means translating from North Sami to SkoltSami. We use the same parallel data as for theNMT model, meaning that on the source side, wehave North Sami and on the target side we have allthe possible cognates in other languages. The par-allel data is aligned with GIZA++ (Och and Ney,2003).

Since we are training an SMT model, there aretwo ways we can make the noisy target of all theother languages resemble more Skolt Sami. Oneis by using a language model. For this, we builda 10-gram language model with KenLM (Heafieldet al., 2013) from Skolt Sami words recorded inthe Giellatekno dictionaries.

The other way of making the model more awareof Skolt Sami in particular is to tune the SMTmodel after the initial training. For the tuning, weuse the Skolt Sami-North Sami parallel data exclu-sively so that the SMT model will go more towardsSkolt Sami when producing cognates.

We use the SMT model to translate all of thewords extracted from the North Sami dictionaryinto Skolt Sami. This results in a parallel datasetof real, existing North Sami words and words thatresemble Skolt Sami. We then use this data tocontinue the training of the previously explainedNMT model for 10 additional epochs.

3.4 Using the NMT Models

We use both of the NMT models, i.e. the one with-out SMT generated additional data and the onewith the data separately to assess the difference intheir performance. We feed in the extracted SkoltSami words from the dictionary and translate eachword to a North Sami word as it would look like if

2As described inhttp://www.statmt.org/moses/?n=moses.baseline

there were a cognate for that word in North Sami.The approach produces many non-words which

we filter out with the North Sami dictionary. Theresulting list of translated words that are actuallyfound in the North Sami dictionary are consideredto be potential cognates found by the method.

4 Results and Evaluation

In this section, we present the results of both ofthe NMT models, the one without SMT generateddata and the one with generated data. The resultsshown in Table 1 indicate that the model with theadditional SMT generated data outperformed theother model. The evaluation is based on a 200 ran-domly selected cognate pairs output by the mod-els. These pairs have then been checked by anexpert linguist according to principles outlined in(4.1).

NMT NMT + SMTaccuracy 67.5% 83%

Table 1: Percentage of correctly found cognates

Table 2 gives more insight on the number ofcognates found and how they are represented inthe original Álgu database. The results showthat while the models have poor performance infinding the cognates in the training data, theywork well in extending the cognates outside of theknown cognate list.

NMT NMT + SMTSame as in Álgu 75 61North Sami word inÁlgu but no cognateswith Skolt Sami

211 226

North Sami word inÁlgu with otherSkolt Sami cognates

646 577

North Sami word notin Álgu

848 936

Cognates found in total 1780 1800

Table 2: Distribution of cognates in relation to Álgu

As one of the purposes of our work is to helpevaluate and develop etymological research, wewill conduct a more qualitative analysis of the cor-rectly and incorrectly identified cognates for thebetter working model, i.e. the one with SMT gen-erated data. This means that the developers should

41

be aware not only of the comparative-linguisticrules etymologists use when assessing the regular-ity of cognate candidates but semantics as well.

In an introduction to Samic language history(Korhonen, 1981: 110–114), the proto-languageis divided into 4 separate phases. The first phaseinvolves vowel changes in the first and second syl-lables (ä–e » e–e

ˆ, u–e » o

ˆ–eˆ

, e–ä » e–a, i–e » eˆ

–eˆ

,etc.) followed by vowel rotation in the first sylla-ble dependent on the quality of the second-syllablevowel (e–e » e–e

ˆbut e–ä » ε–a). The second phase

entails the loss of quantitative distinction for first-syllable vowels such that high vowels are normal-ized as short, and non-high first-syllable vowelsare normally long (e–e

ˆ» e–e

ˆ, ε–a » ε–e, etc.).

The third phase involves a slight vowel shift in thefirst syllable and vowel split in the second. Andfinally, the fourth phase introduces diphthongiza-tion of non-high vowels in the first syllable (e–e

ˆ»

ie–eˆ

, ε–e » eä–e, etc.).In Table 3, below, we provide an approxima-

tion of a few hypothesized sound changes for threewords that are attested in Balto-Finnic and Mord-vin, alike. *käte ‘hand; arm’ has true cognatesin Finnish käsi, Northern Sami giehta, Skolt Samikiõtt, Erzya ked’ and Moksha käd’, *tule ‘fire’ isrepresented by Finnish tuli, Northern Sami dolla,Skolt Sami toll, and Mordvin tol, while *pesä‘nest’ is attested in Finnish pesä, Northern Samibeassi, Skolt Sami pie′ss, Erzya pize, and Mokshapiza. The Roman numerals in the table correspondto four separate phases in Proto-Samic.

I II III IV‘hand; arm’ käte kete

ˆkete

ˆkiete

ˆ‘fire’ tule toˆ

leˆ

toˆ

leˆ

toleˆ‘nest’ pesä pεsa pεse peäse

Table 3: Illustration of some vowel correlations in 4phases of Proto-Samic

In the evaluation, our attention was drawn tothe adherence of (143) items to accepted soundcorrelations while there were (57) candidates thatfailed in this respect (cf. Korhonen, 1981; Lehti-ranta, 2001; Aikio, 2009). Irregular sound corre-lation can be exemplified in the North Sami wordbierdna ’bear’ and its counterpart the Skolt Samiword peä′rnn (the ′ prime indicates palatalizationin Skolt Sami orthography) ’bear cub’. The for-mer appears to represent the word type found in’hand’ North Sami giehta and Skolt Sami kiõtt,

whereas the latter represents the word type foundin ’nest’ North Sami beassi and Skolt Sami pie′ssand ’swamp’ North Sami jeaggi and Skolt Samijeä′gg. Hence, on the basis of the North Samiword bierdna ’bear’, one would posit a SkoltSami form *piõrnn, whereas the Skolt Sami wordpeä′rnn ’bear cub’ would presuppose a NorthSami form *beardni. Both types have firm rep-resentation in both languages, so it would seemthat these borrowings have entered the languagesat separate points in the spatio-temporal dimen-sions.

4.1 Analysis of the Correct Cognates

Correct cognates were selected according to twosimples principles of similarity. On the one hand,there was the principle of conceptual similarity intheir referential denotations (i.e., this refers to fu-ture work in semantic relations). On the otherhand, a feasible cognate pair candidate shoulddemonstrate adherence or near adherence to ac-cepted sound law theory. The question of adher-ence versus near adherence indicated here can bedirected to concepts of sound law theory, whereconceivable irregularities may further be attributedto points in spatio-temporal dimensions (i.e., whenand where a particular word was introduced intothe lexica of the two languages involved in the in-vestigation).

In the investigation of 200 random cognate paircandidates, 166 cognate pair candidates exhibitedconceptual similarity which in some instances sur-passed what might have been discovered using abilingual dictionary. Of the 166 acceptable cog-nate pairs 131 candidate pairs demonstrated regu-lar correlation to received sound law theory.

Adherence to concepts of sound law theory canbe observed in the alignment of the North Samiwords cuoika ’mosquito’ and ada ’marrow’ withtheir Skolt Sami counterparts cuõškk and õõd, re-spectively. Although these words may appearopaque to the layman, and thus this alignmentmight be deemed dubious at first, awareness ofcognate candidates in the Erzya Mordvin seske’mosquito’ and ud′em ’marrow’ helps to alleviateinitial misgivings.

As may be observed above, North Sami fre-quently has two-syllable words where Skolt Samiattests to single-syllable words. This rela-tive length correlation between North Sami andSkolt Sami is described through measurement in

42

prosodic feet (cf. Koponen and Rueter, 2016).While North Sami exhibits retention of the the-oretical stem-final vowel in two-syllable words,Skolt Sami appears to have lost it. In fact, stem-final vowels in Skolt Sami are symptomatic ofproper nouns and participles (derivations). Incontrast, two-syllable words with consonant-finalstems appear in both North Sami and Skolt Sami,which means we can expect a number of basicverbs attesting to two-syllable infinitives (NorthSami bidjat ’put’ and Skolt Sami piijjâd) andcontract-stem nouns (North Sami guoppar andSkolt Sami kuõbbâr ’mushroom’). Upon inspec-tion of longer words, it will appear that Skolt Samiwords are at least one syllable shorter than theircognates in North Sami, which can be attributedto variation in foot development.

Cognate word correlations between North Samiand Skolt Sami can be approached by count-ing syllables in the lemmas (dictionary forms).Through this approach, we can attribute someword lengths automatically to parts-of-speech, i.e.there is only one single-syllable verb in Skolt Samilee′d ’be’, and it correlates to a single-syllableverb in North Sami leat. Other verbs are thereforetwo or more syllables in length in both languages.While two-syllable verbs in North Sami correlatewith two-syllable verbs in Skolt Sami, multiple-syllable verbs in North Sami usually correlate toSkolt Sami counterparts in an X<=>X-1 relation-ship (number of syllables in the language forms,repectively), where North Sami is one orthograph-ical syllable longer.

Short word pairs demonstrate a clear correlationbetween two-syllable base words in North Samiand single-syllable base words in Skolt Sami,which with the exception of 4 words were allnouns (54 all told). The high concentration ofnoun attestation for single-syllable cognate nounsin the two languages of investigation is counter-balanced by the representation of verbs in otherword-length correlation groups.

Cognate pairs where both North Sami and SkoltSami attested to two-syllable lemmas were over-represented by verbs. There were 47 verbs, 22nouns, 10 adjectives and 1 adverb. This result issymptomatic of lemma-based research. Surpris-ingly enough, however, the three-to-two syllablecorrelation between North Sami and Skolt Samialso showed a similar representation: verbs (13),nouns (2) and adverbs (1).

There was one attested correlation for a 5-syllable word eŋgelasgiella ’English language’ inNorth Sami and its 3-syllable counterpart in SkoltSami eŋgglõskiõll. Since we are looking at a com-pound word with 3-to-2 and 2-to-1 correlations,we can assume that our model is recognizing indi-vidual adjacent segments within a larger unit.

Correct cognates do not necessarily require et-ymologically identical source forms or structure.The recognized cognate pairs represent both re-cent loan words or possible irregularities in soundlaw theory (35) and presumably older mutual lex-icon (131) (see 4.2, below). They also attest todiffered structure and length (i.e., this may alsoinclude derivation and compounding). While amajority of the cognate candidate pairs linkedwords sharing the same derivational level, 11 rep-resented instances of additional derivation in ei-ther the North Sami or Skolt Sami word, and 3recognized instances where one of the languageswas represented by a compound word.

4.2 Analysis of the Incorrect Cognates

Incorrect cognates often offer vital input for cog-nate detection development. There are, of course,words pairs that diverge in regard to both acceptedsound law theory and semantic cohesion. Thesepairs have not yet been applied to development. Incontrast, word pairs that appear to adhere to soundlaw theory yet are not matched semantically mightbe regarded as false friends. These pairs can bepotentially useful in further development.

Of 34 semantically non-feasible candidates, 12stood out as false friends. One such example pairis observed in the North Sami álgu ’beginning’and the Skolt Sami älgg ’piece of firewood’. Thesetwo words, it should be noted, can be associatedwith the Finnish cognates alku ’beginning’ andhalko ’piece of split firewood [NB! there is a lossof the word initial h]’, respectively. Since the theo-retically expected vowel equivalent of the first syl-lable a in Finnish is uo and ue in North Sami andSkolt Sami, respectively, we might assume thatneither word comes from a mutual Samic-Finnicproto-language.

We do not know to what extent random selec-tion has affected our results. Had the first NorthSami noun álgu been replaced with its paradig-matic verb álgit ’begin’, the Skolt Sami ä′lgged,also translated as ’begin’, would have shown di-rect correlation for á and ä in North Sami and

43

Skolt Sami, respectively. The second noun, mean-ing ’piece of split firewood’, is actually hálgu inNorth Sami, which simply demonstrates h reten-tion and the possible recognition problems facedin the absence of semantic knowledge.

4.3 Summary of the Analyses

The cognate candidates were evaluated accordingto two criteria: One query checked for concep-tual similarity (correct vs incorrect), and the otherchecked for regularity according to received soundlaw theory. While the majority (65%) of the wordpairs evaluated were both conceptually similar andcorrelated to received sound law theory, an addi-tional 17% of the candidates represented irregularsound correlation, as indicated by the figures 131and 35 below, respectively.

Similar DissimilarRegular 131 12Irregular 35 22

Table 4: Cognate candidate evaluation

The presence of an 11% negative score for bothsound law regularity and conceptual similarity in-dicates an improvement requirement of at least6% before the machine can be considered relevant(95%). The 6% attestation of false friend discov-ery, however, displays an already existing accu-racy in our algorithm.

5 Conclusions and Future Work

In this paper, we have shown that using acharacter-based NMT is a feasible way of expand-ing a list of cognates by training the model mostlyon the cognate pairs for North Sami words in lan-guages other than Skolt Sami. Furthermore, wehave shown that an SMT model can be used togenerate synthetic parallel data by pushing themodel more towards the direction of Skolt Samiby introducing a Skolt Sami language model andtuning the model with Skolt Sami - North Samiparallel data.

In our evaluation, we have only considered thebest cognate produced by the NMT model with theidea of one-to-one mapping. However, it is possi-ble to make the NMT model output more than onepossible translation. In the future, we can conductmore evaluation for a list of top candidates to seewhether the model is able to find more than onecognate for a given word and whether the overall

recall can be improved for the words where thetop candidate has been rejected by the dictionarycheck as a non-word.

We have currently limited our research in cog-nates between Skolt Sami and North Sami wherethe translation direction of the NMT model hasbeen towards North Sami. An interesting futuredirection would be to change the translation direc-tion. In addition to that, we are also interested intrying this method out on other languages recordedin the Álgu database.

We are also interested in conducting researchthat is more linguistic in its nature based on thecognate list produced in this paper. This will shedmore light in the current linguistic knowledge ofcognates in the Sami languages. The current re-sults of the better working NMT model are re-leased in the Online Dictionary for Uralic Lan-guages3.

ReferencesAnte Aikio. 2009. The Saami Loan Words in Finnish

and Karelian, first edition. Faculty of Humanities ofthe University of Oulu, Oulu. Dissertation.

Alina Maria Ciobanu and Liviu P Dinu. 2014. Au-tomatic detection of cognates using orthographicalignment. In Proceedings of the 52nd Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers), volume 2, pages 99–105.

Mika Hämäläinen and Jack Rueter. 2018. Advancesin Synchronized XML-MediaWiki Dictionary De-velopment in the Context of Endangered Uralic Lan-guages. In Proceedings of the Eighteenth EURALEXInternational Congress, pages 967–978.

Mika Hämäläinen, Tanja Säily, Jack Rueter, JörgTiedemann, and Eetu Mäkelä. 2018. Normalizingearly english letters to present-day english spelling.In Proceedings of the Second Joint SIGHUM Work-shop on Computational Linguistics for CulturalHeritage, Social Sciences, Humanities and Litera-ture, pages 87–96.

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan HClark, and Philipp Koehn. 2013. Scalable modi-fied Kneser-Ney language model estimation. In Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), volume 2, pages 690–696.

Guillaume Klein, Yoon Kim, Yuntian Deng, JeanSenellart, and Alexander M. Rush. 2017. Open-NMT: Open-Source Toolkit for Neural MachineTranslation. In Proc. ACL.3http://akusanat.com/

44

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, et al. 2007. Moses: Open sourcetoolkit for statistical machine translation. In Pro-ceedings of the 45th annual meeting of the ACL oninteractive poster and demonstration sessions, pages177–180. Association for Computational Linguis-tics.

Eino Koponen and Jack Rueter. 2016. The first com-plete scientific skolt sami grammar in english. FUF63: 254344, pages 254–266. Pp. 261–265.

Mikko Korhonen. 1981. Johdatus lapin kielen his-toriaan, volume 370 of Suomalaisen Kirjallisuu-den Seuran toimituksia. Suomalaisen KirjallisuudenSeura, Helsinki.

Kotus. 2006. Álgu-tietokanta. Saamelaiskielten ety-mologinen tietokanta. http://kaino.kotus.fi/algu/.

Juhani Lehtiranta. 2001. Yhteissaamelainen sanasto,second edition, volume 200 of Suomalais-Ugrilaisen Seuran toimituksia. Suomalais-Ugrilainen Seura, Helsinki.

Sjur N. Moshagen, Tommi A. Pirinen, and TrondTrosterud. 2013. Building an open-source de-velopment infrastructure for language technologyprojects. In Proceedings of the 19th Nordic Con-ference of Computational Linguistics (NODALIDA2013); May 22-24; 2013; Oslo University; Norway.NEALT Proceedings Series 16, 85, pages 343–352.Linköping University Electronic Press.

Franz Josef Och and Hermann Ney. 2003. A systematiccomparison of various statistical alignment models.Computational Linguistics, 29(1):19–51.

Yves Peirsman and Sebastian Padó. 2010. Cross-lingual induction of selectional preferences withbilingual vector spaces. In Human Language Tech-nologies: The 2010 Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 921–929. Association forComputational Linguistics.

Taraka Rama. 2016. Siamese convolutional networksfor cognate identification. In Proceedings of COL-ING 2016, the 26th International Conference onComputational Linguistics: Technical Papers, pages1018–1027.

Taraka Rama, Johannes Wahle, Pavel Sofroniev, andGerhard Jäger. 2017. Fast and unsupervised meth-ods for multilingual cognate clustering. arXivpreprint arXiv:1702.04938.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2015. Improving neural machine translationmodels with monolingual data. arXiv preprintarXiv:1511.06709.

Adam St. Arnaud, David Beck, and Grzegorz Kondrak.2017. Identifying cognate sets across dictionaries ofrelated languages. In Proceedings of the 2017 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 2519–2528.

45


Seeing more than whitespace —Tokenisation and disambiguation in a North Sámi grammar checker

Linda WiechetekUiT The Arctic University of [email protected]

Kevin Brubeck UnhammerTrigram AS

[email protected]

Sjur Nørstebø MoshagenUiT The Arctic University of [email protected]

AbstractCommunities of lesser resourced languageslike North Sámi benefit from language toolssuch as spell checkers and grammar checkersto improve literacy. Accurate error feedbackis dependent on well-tokenised input, but tra-ditional tokenisation as shallow preprocessingis inadequate to solve the challenges of real-world language usage. We present an alterna-tive where tokenisation remains ambiguous un-til we have linguistic context information avail-able. This lets us accurately detect sentenceboundaries, multiwords and compound errordetection. We describe a North Sámi grammarchecker with such a tokenisation system, andshow the results of its evaluation.

1 Introduction

Bilingual users frequently face bigger challengesregarding literacy in the lesser used language thanin the majority language due to reduced access tolanguage arenas (Outakoski, 2013; Lindgren et al.,2016). However, literacy and in particular writ-ing is important in today’s society, both in socialcontexts and when using a computer or a mobilephone. Language tools such as spellcheckers andgrammar checkers therefore play an important rolein improving literacy and the quality of written textin a language community.

North Sámi is spoken in Norway, Sweden andFinland by approximately 25,700 speakers (Simonsand Fennig, 2018), and written in a number of in-stitutions like the daily Sámi newspaper (Ávvir1), afew Sámi journals, websites and social media of theSámi radio and TV (e.g. YleSápmi2). In addition,the Sámi parliaments, the national governments,and a Sámi university college produce North Sámitext.

An open-source spellchecker for North Sámi hasbeen freely distributed since 2007 (Gaup et al.,

1https://avvir.no/ (accessed 2018-10-08)2https://yle.fi/uutiset/osasto/sapmi/

(accessed 2018-10-08)

2006).3 However, a spellchecker is limited to look-ing only at one word contexts. It can only de-tect non-words, i.e. words that cannot be foundin the lexicon. A grammar checker, however, looksat contexts beyond single words, and can correctmisspelled words that are in the lexicon, but arewrong in the given context. In addition, a grammarchecker can detect grammatical and punctuationerrors.

A common error in North Sámi and other com-pounding languages is to spell compound words asseparate words instead of one. The norm typicallyrequires them to be written as one word, with thenon-final components being in nominative or geni-tive case if they are nouns. This reflects a differencein meaning between two words written separatelyand the same two words written as a compound.Being able to detect and correct such compoundingerrors is thus important for the language commu-nity.

This paper presents and evaluates a grammarchecker framework that handles ambiguous tokeni-sation, and uses that to detect compound errors, aswell as improve sentence boundary detection afterabbreviations and numeral expressions. The frame-work is completely open source, and completelyrule-based. The evaluation is done manually, sincegold standards for North Sámi tokenisation havenot been developed prior to this work.

2 Background

The system we present is part of a full-scale gram-mar checker (Wiechetek, 2017, 2012). Before thiswork, there were no grammar checkers for the Sámilanguages although some grammar checker-likework has been done in the language learning plat-form Oahpa (Antonsen, 2012). However, therehave been several full-scale grammar checkers for

3 In addition to that, there are syntactic disambiguationgrammars, machine translators, dictionaries and a taggedsearchable online corpus.

46

other Nordic languages, most of them implementedin the rule-based framework Constraint Grammar(CG). Lingsoft distributes grammar checkers forthe Scandinavian languages,4 some of which are orhave been integrated into MS Word; a stand-alonegrammar checker like Grammatifix (Arppe, 2000)is also available from Lingsoft. Another widelyused, mostly rule-based and free/open-source sys-tem is LanguageTool (Milkowski, 2010), thoughthis does not yet support any Nordic languages.Other CG-based checkers are OrdRet (Bick, 2006)and DanProof (Bick, 2015) for Danish.

2.1 FrameworkThe central tools used in our grammar checker arefinite state transducers (FST’s) and CG rules. CGis a rule-based formalism for writing disambigua-tion and syntactic annotation grammars (Karlsson,1990; Karlsson et al., 1995). The vislcg3 implemen-tation5 we use also allows for dependency annota-tion. CG relies on a bottom-up analysis of runningtext. Possible but unlikely analyses are discardedstep by step with the help of morpho-syntactic con-text.

All components are compiled and built using theGiella infrastructure (Moshagen et al., 2013). Thisinfrastructure helps linguists coordinate resourcedevelopment using common tools and a common ar-chitecture. It also ensures a consistent build processacross languages, and makes it possible to propa-gate new tools and technologies to all languageswithin the infrastructure. That is, the progress de-scribed in this paper is immediately available to alllanguages in the Giella infrastructure, barring thenecessary linguistic work.

The North Sámi CG analysers take morpho-logically ambiguous input, which is the outputfrom analysers compiled as FST’s. The source ofthese analysers is written in the Xerox twolc6 andlexc (Beesley and Karttunen, 2003) formalisms,compiled and run with the free and open sourcepackage HFST (Lindén et al., 2011).

We also rely on a recent addition to HFST, hfst-pmatch (Hardwick et al., 2015) (inspired by Xeroxpmatch (Karttunen, 2011)) with the runtime toolhfst-tokenise. Below we describe how this lets us

4http://www2.lingsoft.fi/doc/swegc/errtypes.html (accessed 2018-10-08)

5http://visl.sdu.dk/constraint_grammar.html (accessed 2018-10-08), also Bick and Didriksen (2015)

6Some languages in the Giella infrastructure describe theirmorphophonology using Xfst rewrite rules; both twolc andrewrite rules are supported by the Giella infrastructure.

analyse and tokenise in one step, using FST’s toidentify regular words, multiword expressions andpotential compound errors.

It should be noted that the choice of rule-basedtechnologies is not accidental. The complexityof the languages we work with, and the generalsparsity of data, makes purely data-driven methodsinadequate. Additionally, rule-based work leadsto linguistic insights that feed back into our gen-eral understanding of the grammar of the language.We chose a Constraint Grammar rule-based systemsince it is one we have long experience with, andit has proven itself to be competitive both in high-and low-resource scenarios. For example, Dan-Proof (Bick, 2015, p.60) scores more than twicethat of Word2007 on the F1 measure (72.0% vs30.1%) for Danish grammar checking. CG alsocompares favourably to modern deep learning ap-proaches, e.g. DanProof ’s F0.5 (weighting preci-sion twice as much as recall) score is 80.2%, versusthe 72.0% reported by Grundkiewicz and Junczys-Dowmunt (2018).

In addition, most current approaches rely verymuch on large-scale manually annotated corpora,7

which do not exist for North Sámi. It makes senseto reuse large already existing corpora for traininglanguage tools. However, in the absence of these,it is more economical to write grammars of hand-written rules that annotate a corpus linguisticallyand/or do error detection/correction. As no othermethods for developing error detection tools existfor North Sámi or similar languages in compara-ble situations (low-resourced in terms of annotatedcorpus, weak literacy, higher literacy in the major-ity languages), it is impossible for us to provide acomparison with other technologies.

2.2 Motivation

This section describes some of the challengesthat lead to the development of our new grammarchecker modules.

A basic feature of a grammar checker is to cor-rect spelling errors that would be missed by a spellchecker, that is, orthographically correct words thatare nevertheless wrong in the given context.

(1) Beroštupmiinterest

gáktegoarrun|gursiicostume.sewing|course.ILL

‘An interest in a costume sewing course’

7“Automatic grammatical error correction (GEC) progressis limited by corpora available for developing and evaluatingsystems.” (Tetreault et al., 2017, p.229)

47

In the North Sámi norm, generally (nominal) com-pounds are written as one word; it is an error toinsert a space at the compound border. Ex. (1)marks the compound border with a pipe.

(2) *Beroštupmi gáktegoarrun gursii

If the components of a compound are separatedby a space as in ex. (2) (cf. the correct spellingin ex (1)), the grammar checker should detect acompound spacing error.

Compound errors can not be found by means ofa non-contextual spellchecker, since adjacent nom-inals are not automatically compound errors. Theymay also have a syntactic relation. Our lexiconcontains both the information that gáktegoarrun-gursii would be a legal compound noun if writtenas one word, and the information needed to say thatgáktegoarrun gursii may have a syntactic relationbetween the words, that is, they are independenttokens each with their own analysis.8 We there-fore assume ambiguous tokenisation. In order todecide which tokenisation is the correct one, weneed context information.

In addition, there is the issue of combinatorialexplosion. For example, the bigram guhkit áiggi‘longer time’ may be a compound error in one con-text, giving an analysis as a single noun token. Butit is also ambiguous with sixteen two-token read-ings, where the first part may be adjective, adverbor verb. We want to include these as alternativereadings.

A naïve solution to getting multiple, ambiguoustokenisations of a string like guhkit áiggi would beto insert an optional space in the compound borderin the entry for dynamic compounds, with an errortag. But if we analyse by longest match, the errorreading would be the only possible reading. Wecould make the error tag on the space be optional,which would make the entry ambiguous betweenadjective+noun and compound, but we’d still bemissing the adverb/verb+noun alternatives, whichdo not have a compound border between them. Toexplicitly encode all correct alternatives to com-pound errors in the lexicon, we would need to enterreadings for e.g. verb+noun bigrams simply be-cause they happen to be ambiguous with an errorreading of a nominal compound.

Manually adding every bigram in the lexicon

8 The non-head noun sometimes has an epenthetic onlywhen used as a compound left-part, information which is alsoencoded in the lexicon.

that happens to be ambiguous with an error wouldbe extremely tedious and error-prone. Adding itautomatically through FST operations turns out toquickly exhaust memory and multiply the size ofthe FST. Our solution would need to avoid thisissue.

(3) omd.for.example

sámeskuvllasSámi.school.LOC

‘for example in the Sámi school’

(4) omd.for.example

ÁlttásAlta.LOC

sámeskuvllasSámi.school.LOC

‘for example in Alta in the Sámi school’

In the fragment in ex. (3)–(4) above, the pe-riod after the abbreviation omd. ‘for example’ isambiguous with a sentence boundary. In the firstexample, we could use the local information thatthe noun sámeskuvllas ‘Sámi school (Loc.)’ is low-ercase to tell that it is not a sentence boundary.However, the second sentence has a capitalisedproper noun right after omd. and the tokenisationis less straightforward. We also need to know that,if it is to be two tokens instead of one, the formsplits before the period, and the tags belongingto "<omd>" go with that form, and the tags be-longing to "<.>" go with that form. That is, weneed to keep the information of which substringsof the form go with which readings of the whole,ambiguously-tokenised string.

As this and the previous examples show, we needcontext information to resolve the ambiguity; thismeans we need to defer the resolution of ambiguoustokenisation until after we have some of the mor-phological/syntactic/semantic analysis available.

(5) Itgonot.SG2.Q

donyou

muitteremember

‘Don’t you remember’

(6) Itnot.SG2

goQ

donyou

muitteremember

‘Don’t you remember’

Ex. (5) and (6) above are equivalent given thecontext – the space is just a matter of style, whenused in this sense – but go appearing as a word onits own is locally ambiguous, since the questionparticle go may in other contexts be a conjunction(meaning ‘when, that’). We want to treat Itgo ‘don’tyou’ as two tokens It+go; having equal analyses forthe equal alternatives (after disambiguation) wouldsimplify further processing. This can be encoded inthe lexicon as one entry which we might be able to

48

split with some postprocessing, but before the cur-rent work, our tools gave us no way to show whatparts of the form corresponded to which tokens. Atypical FST entry (here expanded for simplicity)might contain

ii+V+Sg2+TOKEN+go+Qst:itgo

Now we want to encode that the form splits be-tween ‘it’ and ‘go’, and that ‘ii+V+2Sg’ belongs to‘it’, and that ‘go+Qst’ belongs to ‘go’. But insertinga symbol into the form would mean that the formno longer analyses; we need to somehow mark thesplit-point.

Our system solves all of the above issues – weexplain how below.

3 Method

Below we present our grammar checker pipeline,and our method to analyse and resolve ambiguoustokenisation. We first describe the system architec-ture of the North Sámi grammar checker, then ourmorphological analysis and tokenisation method,and finally our method of finding errors by disam-biguating ambiguous tokenisation.

3.1 System architecture

The North Sámi grammar checker consists of dif-ferent modules that can be used separately or incombination, cf. Figure 1.

The text is first tokenised and morphologicallyanalysed by the descriptive morphological ana-lyser tokeniser-gramcheck-gt-desc.pmhfst, whichhas access to the North Sámi lexicon with both er-ror tags and lexical semantic tags. The followingstep, analyser-gt-whitespace.hfst, detects and tagswhitespace errors. It also tags the first words ofparagraphs and other whitespace delimited bound-aries, which can then be used by the boundary de-tection rules later on, which enables detecting e.g.headers based on their surrounding whitespace.

The valency annotation grammar valency.cg3adds valency tags to potential governors. Thenfollows the module that disambiguates ambigu-ous tokenisation, mwe-dis.cg3, which can selector remove compound readings of multi-word ex-pressions based on the morpho-syntactic contextand valencies. It can also decide whether punc-tuation is a sentence boundary or not. The nextmodule, divvun-cgspell, takes unknown words andruns them through our spell checker, where sugges-tions include morphological analyses.

The next module is the CG grammar grc-disam-biguator.cg3, which performs morpho-syntacticanalysis and disambiguation, except for the spellersuggestions, which are left untouched. Thedisambiguator is followed by a CG module,spellchecker.cg3, which aims to reduce the sug-gestions made by the spellchecker by means of thegrammatical context. The context is now partly dis-ambiguated, which makes it easier to decide whichsuggestions to keep, and which not.9

The last CG module is grammarchecker.cg3,which performs the actual error detection and cor-rection – mostly for other error types than spellingor compound errors. The internal structure of gram-marchecker.cg3 is more complex; local case errordetection takes place after local error detection,governor-argument dependency analysis, and se-mantic role mapping, but before global error detec-tion.

Finally, the correct morphological forms are gen-erated from tag combinations suggested in gram-marchecker.cg3 by means of the normative mor-phological generator generator-gt-norm.hfstol, andsuggested to the user along with a short feedbackmessage of the identified error.

3.2 Ambiguous tokenisation

A novel feature of our approach is the support fordifferent kinds of ambiguous tokenisation in theanalyser, and how we disambiguate ambiguous to-kens using CG rules.

We do tokenisation as part of morphologicalanalysis using the hfst-tokenise tool, which doesa left-to-right longest match analysis of the input,where matches are those given by a pmatch anal-yser. This kind of analyser lets us define tokenisa-tion rules such as “a word from our lexicon mayappear surrounded by whitespace or punctuation”.The pmatch analyser imports a regular lexical trans-ducer, and adds definitions for whitespace, punc-tuation and other tokenisation hints; hfst-tokeniseuses the analyser to produce a stream of tokenswith their morphological analysis in CG format.

As an example, hfst-tokenise will turn the inputii, de ‘not, then’ into three CG cohorts:

9 In later work done after the submission, we tried usinggrc-disambiguator.cg3 again after applying spellchecker.cg3,this time allowing it to remove speller suggestions. Given thatthe context was now disambiguated, and problematic spellersuggestion cases had been handled by spellchecker.cg3, itdisambiguated the remaining speller suggestions quite well,and left us with just one or a few correct suggestions to presentto the user.

49

Figure 1: System architecture of the North Sámi grammarchecker

"<ii>""ii" V IV Neg Ind Sg3 <W:0>

"<,>""," CLB <W:0>

:"<de>"

"de" Adv <W:0>"de" Pcle <W:0>

The space between the words is printed afterthe colon. The analyses come from our lexicaltransducer.

Define morphology @bin"analyser.hfst" ;Define punctword morphology &

[ Punct:[?*] ] ;Define blank Whitespace |

Punct ;Define morphoword morphology

LC([blank | #])RC([blank | #]) ;

regex [ morphoword | punctword ];

The above pmatch rules say that a word from thelexicon (analyser.hfst) has to be surrounded by a"blank", where a blank is either whitespace or punc-

tuation. The LC/RC are the left and right contextconditions. We also extract (intersect) the subsetof the lexicon where the form is punctuation, andallow that to appear without any context conditions.

We insert re-tokenisation hints in the lexicon atplaces where we assume there is a possible tokeni-sation border, and our changes to hfst-tokenise letthe analyser backtrack and look for other tokenisa-tions of the same input string. That is, for a givenlongest match tokenisation, we can force it to redothe tokenisation so we get other multi-token read-ings with shorter segments alongside the longestmatch. This solves the issue of combinatorial ex-plosion.

As a simple example, the ordinal anal-ysis of 17. has a backtracking mark be-tween the number and the period. Ifthe lexicon contains the symbol-pairs/arcs:1:1 7:7 ε:@PMATCH_BACKTRACK@ε:@PMATCH_INPUT_MARK@ .:A ε:Ord

50

then, since the form-side of this analysis is 17.,the input 17. will match, but since there was abacktrack-symbol, we trigger a retokenisation. Theinput-mark symbol says where the form should besplit.10 Thus we also get analyses of 17 and . astwo separate tokens."<17.>"

"17" A Ord Attr"." CLB "<.>"

"17" Num "<17>"

To represent tokenisation ambiguity in the CGformat, we use vislcg3 subreadings,11 wheredeeper (more indented) readings are those thatappeared first in the stream, and any readingwith a word-form-tag ("<.>" above) should(if chosen by disambiguation) be turned intoa cohort of its own. Now we may run aregular CG rule to pick the correct readingbased on context, e.g. SELECT (".") IF (1some-context-condition) ...; whichwould give us"<17.>"

"." CLB "<.>""17" Num "<17>"

Then a purely mechanical reformatter named cg-mwesplit turns this into separate tokens, keepingthe matching parts together:"<17>"

"17" Num"<.>"

"." CLB

We also handle possible compound errors withthe above scheme. When compiling the lexicaltransducer, we let all compound boundaries option-ally be realised as a space. Two successive nounslike illu sáhka ‘happiness news (i.e. happy news)’will be given a compound analysis which includesan error tag. We also insert a backtracking sym-bol with the space, so that the tokenisation toolknows that the compound analysis is not necessar-ily the only one (but without having to explicitlylist all possible alternative tokenisations). If the re-tokenisation finds that the nouns can be analysedand tokenised independently, then those tokens andanalyses are also printed."<illu sáhka>"

10This also means we cannot reshuffle the input/outputside of the FST. In practice, we use a flag diacritic in thelexicon, which will keep its place during minimisation, andafter the regular lexicon is compiled, we turn the flag into theε:@PMATCH_INPUT_MARK@ symbol-pair.

11https://visl.sdu.dk/cg3/chunked/subreadings.html (accessed 2018-10-10)

"illusáhka" N Sg Nom Err/SpaceCmp"sáhka" N Sg Nom "< sáhka>"

"illu" N Sg Nom "<illu>"

Given such an ambiguous tokenisation, CG ruleschoose between the compound error and the two-token readings, using context information from therest of the sentence. If the non-error reading waschosen, we get:"<illu sáhka>"

"sáhka" N Sg Nom "< sáhka>""illu" N Sg Nom "<illu>"

which cg-mwesplit reformats to two cohorts:"<illu>"

"illu" N Sg Nom"< sáhka>"

"sáhka" N Sg Nom

3.3 Rule-based disambiguation of ambiguoustokenisation

As mentioned above, disambiguation of ambigu-ous tokenisation is done after morphological analy-sis. Consequently, this step has access to undisam-biguated morphological (but not full syntactical)information. In addition, lexical semantic tags andvalency tags are provided. The rules that resolvesentence boundary ambiguity are based on transi-tivity tags of abbreviations, lexical semantic tags,and morphological tags. Some of them are specificto one particular abbreviation.

Bi- or trigrams given ambiguous tokenisationcan either be misspelled compounds (i.e. in NorthSámi typically two-part compounds are the norm)or two words with a syntactic relation. The as-sumption is that if a compound is lexicalised, twoor more adjacent words may be analysed as a com-pound and receive an errortag (Err/SpaceCmp), us-ing a CG rule such as the following:

SELECT SUB:* (Err/SpaceCmp) IF (NEGATE0/* Err/MissingSpace OR Ess);

This rule selects the error reading unless any sub-reading of this reading (0/*) already has anothererror tag or is an essive case form.

This is the case unless any other previously ap-plied rule has removed the error reading. Versionr172405 of the tokenisation disambiguation gram-mar mwe-dis.cg3 has 40 REMOVE rules and 8SELECT rules.

Compound errors are ruled out for example ifthe first word is in genitive case as it can be the firstpart of a compound but also a premodifier. Thesimplified CG rule below removes the compound

51

error reading if the first component is in genitiveunless it receives a case error reading (nomina-tive/accusative or nominative/genitive) or it is alesser used possessive reading and a non-humannoun. The rule makes use of both morphologicaland semantic information.

REMOVE (Err/SpaceCmp) IF (0/1 Gen -Allegro - Err/Orth-nom-acc - Err/Orth-nom-gen - PX-NONHUM);

(7) GaskavahkuWednesday.GEN

eahkedaevening.ACC

‘Wednesday evening’

(8) áhpehispregnant

nissonolbmuidwoman.ACC.PL

‘pregnant women’

In ex. (7), gaskavahku ‘Wednesday’ is in genitivecase. The context to rule out a compound erroris very local. In ex. (8), áhpehis ‘pregnant’ thefirst part of the potential compound is an attributiveadjective form. Also here compound errors arecategorically discarded.

(9) PalttoPaltto

leais

riegádan jagiborn.PRFPRC year.ACC

19471947

‘Paltto was born in 1947’

(10) galggaishould

buotall

báikkiinplace.LOC.PL

dárogiellaNorwegian.NOM

oahpahusgiellaninstructing.language in all places‘Norwegian had to be the instructing lan-guage’

Other cases of compound error disambiguation,however, are more global. In ex. (9), riegádan jagi‘birth year (Acc.)’ is a lexicalized compound. How-ever as it is preceded by a finite verb, which is alsoa copula, i.e. lea ‘is’, the perfect participle formriegádan ‘born’ is part of a past tense construction(‘was born’), and the compound error needs to bediscarded.

In ex. (10), on the other hand, the relation be-tween the first part of the bigram (dárogiella ‘Nor-wegian’) and the second part (oahpahusgiellan ‘in-structing language (Ess.)’) is that of a subject toa subject predicate. The disambiguation grammarrefers to a finite copula (galggai ‘should’) preced-ing the bigram.

4 Evaluation

In this section we evaluate the previously describedmodules of the North Sámi grammar checker.

Firstly, we evaluate the disambiguation of com-pound errors in terms of precision and recall. Thenwe compare our system for sentence segmentationwith an unsupervised system. Since a corpus withcorrectly annotated compound and sentence bound-ary tokenisation for North Sámi is not available, allevaluation and annotation is done from scratch. Weuse the SIKOR corpus (SIKOR2016)),12 a descrip-tive corpus which contains automatic annotationsfor linguistic research purposes, but no manuallychecked/verified tags. We selected a random cor-pus of administrative texts for two reasons. We hada suspicion that it would have many abbreviationsand cases of ambiguous tokenisation. Secondly,administrative texts stand for a large percentage ofthe total North Sámi text body, and the genre isthus important for a substantial group of potentialusers of our programs.

4.1 Compound error evaluation

For the quantitative evaluation of the disambigua-tion of potential compound errors we calculatedboth precision (correct fraction of all marked er-rors) and recall (correct fraction of all errors).The corpus used contains 340,896 space separatedstrings, as reported by the Unix tool wc. The exactnumber of tokens will vary depending on tokenisa-tion techniques, as described below.

The evaluation is based on lexicalised com-pounds as potential targets of ambiguous tokeni-sation. A previous approach allowed ambiguoustokenisation of dynamic compounds too, solely us-ing syntactic rules to disambiguate. However, thisled to many false positives (which would requiremore rules to avoid). Since our lexicon has over110,000 lexicalised compounds (covering 90.5 %of the compounds in the North Sámi SIKOR corpus)coverage is acceptable without the riskier dynamiccompound support.13

Table 1 contains the quantitative results of thecompound error evaluation. Of the 340.895 run-ning bigrams in the text, there were a total of 4.437potential compound errors, i.e. 1.30 % of runningbigrams are analysed as possible compounds byour lexicon. On manually checking, we found 458of these to be true compound errors (0.13 % of run-ning bigrams, or 10.3 % of potential compounderrors as marked by the lexicon). So the table

12SIKOR contains a range of genres; the part used for eval-uation contains bureaucratic texts.

13For less developed lexicons, the trade-off may be worthit.

52

True positives 360False positives 110True negatives 3,869False negatives 98Precision 76.6%Recall 78.6%

Table 1: Qualitative evaluation of CG compound errordetection

indicates how well our Constraint Grammar dis-ambiguates compound errors from words that aresupposed to be written apart, and tells nothing ofthe work done by the lexicon in selecting possiblecompound errors (nor of possible compound errorsmissed by the lexicon).14

Precision for compound error detection is wellabove the 67% threshold for any error type in acommercial grammar checker mentioned by Arppe(2000, p.17), and the F0.5 (weighting precisiontwice as much as recall) score is 77.0%, abovee.g. Grundkiewicz and Junczys-Dowmunt (2018)’s72.0%.15

False positives occur for example in cases wherethere is an internal syntactic structure such as inthe case of ex. (11), where both bálvalus ‘service’and geavaheddjiide ‘user (Ill. Pl.)’ are participantsin the sentence’s argument structure. Since thereis no finite verb, the syntactic relation could onlybe identified by defining the valency of bálvalus‘service’.

(11) BuoretBetter

bálvalusservice.NOM.SG

geavaheddjiideuser.ILL.PL

‘Better service to the users’

A number of the false negatives (cf. ex. (12)) aredue to frequent expressions including lágan (i.e.iešgudetlagan ‘different’, dánlágan ‘this kind of’,etc.), which need to be resolved by means of anidiosyncratic rule. Dan and iešgudet are genitiveor attributive pronoun forms and not typically partof a compound, so a syntactic rule only does notresolve the problem.

14We have also not calculated the number of actual com-pounds in the evaluation corpus, so the ratio of compounderrors to correct compounds is unknown.

15 We would like to compare performance on this task witha state-of-the-art machine learning method, but have seenno references for this particular task to use as an unbiasedbaseline. However, the gold data set that was developed forevaluating our method is freely available to researchers whowould like to experiment with improving on the results.

(12) *iešgudetdifferent

lágankinds

molssaeavttutalternative.PL

‘Different kinds of alternatives’

(13) *Láhkalaw.NOM;lacquer.GEN

rievdadusaidchanging.ACC.PL

birraabout‘About the law alterations’

In ex. (13), there is a compound error. However,one of the central rules in the tokeniser disambigua-tion grammar removes the compound error readingif the first part of the potential compound is in thelong genitive case form. However, in this caseláhka can be both the genitive form of láhkka ‘lac-quer’ and the nominative form of láhka ‘law’. Thisunpredictable lexical ambiguity had not been takeninto account by the disambiguation rule, which iswhy there is a false negative. In the future it canbe resolved by referring to the postposition birra‘about’, which asks for a preceding genitive.

4.2 Sentence boundary evaluationA common method for splitting sentences in a com-plete pipeline (used e.g. by LanguageTool) is totokenise first, then do sentence splitting, followedby other stages of linguistic analysis. Here a stan-dalone tokeniser would be used, e.g. PUNKT (Kissand Strunk, 2006), an unsupervised model that usesno linguistic analysis, or GATE16 which uses regex-based rules. The Python package SpaCy17 on theother hand trains a supervised model that predictssentence boundaries jointly with dependency struc-ture. Stanford CoreNLP18 uses finite state automatato tokenise, then does sentence splitting.

In contrast, our method uses no statistical infer-ence. We tokenise as the first step, but the tokenisa-tion remains ambiguous until part of the linguisticanalysis is complete.

Below, we make a comparison with PUNKT19,which, although requiring no labelled training data,has been reported20 to perform quite well comparedto other popular alternatives.

As with the above evaluation, we used bureau-cratic parts of the SIKOR corpus. We trained thePUNKT implementation that comes with NLTK on

16http://gate.ac.uk/ (accessed 2018-10-08)17https://spacy.io/ (accessed 2018-10-08)18http://www-nlp.stanford.edu/software/

corenlp.shtml (accessed 2018-10-08)19https://www.nltk.org/_modules/nltk/

tokenize/punkt.html (accessed 2018-10-08)20https://tech.grammarly.com/blog/

how-to-split-sentences (accessed 2018-10-08)

53

System PUNKT DivvunTrue pos. 1932 1986False pos. (split mid-sent) 39 29True neg. 474 484False neg. (joined sents) 55 1Precision 98.02% 98.56%Recall 97.23% 99.95%

Table 2: Sentence segmentation errors per system on2500 possible sentences.22

287.516 "words" (as counted by wc), and manu-ally compared the differences between our system(named divvun below) and PUNKT. We used atrivial sed script s/[.?:!] */&\n/g to cre-ate a "baseline" count of possible sentences, andran the evaluation on the first 2500 potential sen-tences given by this script (as one big paragraph),counting the places where the systems either splitsomething that should have been one sentence, ortreated two sentences as one; see table 2.

Of the differences, we note that PUNKT oftentreats abbreviations like nr or kap. as sentenceboundaries, even if followed by lower-case wordsor numbers (st. meld. 15 as three sentences).Our system sometimes makes this mistake too, butmuch more rarely. Also, PUNKT never treats colonas sentence boundaries. The colon in Sámi is usedfor case endings on names, e.g. Jönköping:s, butof course also as a clause or sentence boundary.Thus many of the PUNKT errors are simply notmarking a colon as a sentence boundary. On theother hand, our system has some errors where anunknown word led to marking the colon (or period)as a boundary. This could be fixed in our systemwith a simple CG rule.

There are also some odd cases of PUNKT notsplitting on period even with following space andtitle cased word, e.g. geavahanguovlluid. Rád-jegeassin. Where the baseline sed script createsthe most sentence boundaries in our evaluation testset (2500), our system creates 2015 sentences, andPUNKT 1971.

Our system is able to distinguish sentence bound-aries where the user forgot to include a space, e.g.buorrin.Vuoigatvuodat is correctly treated as a sen-tence boundary. This sort of situation is hard todistinguish in general without a large lexicon. Oursystem does make some easily fixable errors, e.g.kap.1 was treated as a sentence boundary due to awrongly-written CG rule (as such, this evaluation

has been helpful in uncovering silly mistakes). Be-ing a rule-based system, it is easy to support newcontexts when required.

5 Conclusion

We have introduced the North Sámi grammarchecker presenting its system architecture and de-scribed its use and necessity for the North Sámilanguage community. Tokenisation is the first stepin a grammar checker when approaching frequentspelling error types that cannot be resolved withoutgrammatical context. We are questioning the tradi-tional concept of a token separated by a space, notonly in terms of multiwords, but also in terms of po-tential compound errors. Our experiment showedthat our system outperforms a state-of-the-art un-supervised sentence segmenter. Disambiguationof compound errors and other two-word combina-tions give good results both in terms of precisionand recall, i.e. both are above 76%. Our methodof ambiguous tokenisation and ambiguity resolu-tion by means of grammatical context allows usto improve tokenisation significantly compared tothe standard approaches. The integration of thegrammar checker framework in the Giella infras-tructure ensures that this approach to tokenisationis directly available to all other languages usingthis infrastructure.

Acknowledgments

We especially would like to thank Thomas Ommafor testing rules and checking examples within theabove discussed modules, and our colleagues inDivvun and Giellatekno for their daily contributionsto our language tools and the infrastructure.

ReferencesLene Antonsen. 2012. Improving feedback on l2 mis-

spellings - an fst approach. In Proceedings of theSLTC 2012 workshop on NLP for CALL; Lund; 25thOctober; 2012, 80, pages 1–10. Linköping Univer-sity Electronic Press; Linköpings universitet.

Antti Arppe. 2000. Developing a grammar checkerfor Swedish. In Proceedings of the 12th NordicConference of Computational Linguistics (NoDaL-iDa 1999), pages 13–27, Department of Linguistics,Norwegian University of Science and Technology(NTNU), Trondheim, Norway.

Kenneth R. Beesley and Lauri Karttunen. 2003. FiniteState Morphology. CSLI Studies in ComputationalLinguistics. CSLI Publications, Stanford.

54

Eckhard Bick. 2006. A constraint grammar basedspellchecker for Danish with a special focus ondyslexics. In Mickael Suominen, Antti Arppe, AnuAirola, Orvokki Heinämäki, Matti Miestamo, UrhoMäättä, Jussi Niemi, Kari K. Pitkänen, and KaiusSinnemäki, editors, A Man of Measure: Festschriftin Honour of Fred Karlsson on his 60th Birth-day, volume 19/2006 of Special Supplement to SKYJounal of Linguistics, pages 387–396. The Linguis-tic Association of Finland, Turku.

Eckhard Bick. 2015. DanProof: Pedagogical spell andgrammar checking for Danish. In Proceedings of the10th International Conference Recent Advances inNatural Language Processing (RANLP 2015), pages55–62, Hissar, Bulgaria. INCOMA Ltd.

Eckhard Bick and Tino Didriksen. 2015. CG-3 – be-yond classical Constraint Grammar. In Proceedingsof the 20th Nordic Conference of Computational Lin-guistics (NoDaLiDa 2015), pages 31–39. LinköpingUniversity Electronic Press, Linköpings universitet.

Børre Gaup, Sjur Moshagen, Thomas Omma, MaarenPalismaa, Tomi Pieski, and Trond Trosterud. 2006.From Xerox to Aspell: A first prototype of a northsámi speller based on twol technology. In Finite-State Methods and Natural Language Processing,pages 306–307, Berlin, Heidelberg. Springer BerlinHeidelberg.

Roman Grundkiewicz and Marcin Junczys-Dowmunt.2018. Near human-level performance in grammati-cal error correction with hybrid machine translation.arXiv preprint arXiv:1804.05945.

Sam Hardwick, Miikka Silfverberg, and Krister Lindén.2015. Extracting semantic frames using hfst-pmatch. In Proceedings of the 20th Nordic Con-ference of Computational Linguistics, (NoDaLiDa2015), pages 305–308.

Fred Karlsson. 1990. Constraint Grammar as a Frame-work for Parsing Running Text. In Proceedingsof the 13th Conference on Computational Linguis-tics (COLING 1990), volume 3, pages 168–173,Helsinki, Finland. Association for ComputationalLinguistics.

Fred Karlsson, Atro Voutilainen, Juha Heikkilä, andArto Anttila. 1995. Constraint Grammar: ALanguage-Independent System for Parsing Unre-stricted Text. Mouton de Gruyter, Berlin.

Lauri Karttunen. 2011. Beyond morphology: Patternmatching with FST. In SFCM, volume 100 of Com-munications in Computer and Information Science,pages 1–13. Springer.

Tibor Kiss and Jan Strunk. 2006. Unsupervised mul-tilingual sentence boundary detection. Computa-tional Linguistics, 32(4):485–525.

Krister Lindén, Miikka Silfverberg, Erik Axel-son, Sam Hardwick, and Tommi Pirinen. 2011.

Hfst—framework for compiling and applying mor-phologies. In Cerstin Mahlow and MichaelPietrowski, editors, Systems and Frameworks forComputational Morphology, volume Vol. 100 ofCommunications in Computer and Information Sci-ence, pages 67–85. Springer-Verlag, Berlin, Heidel-berg.

Eva Lindgren, Kirk P H Sullivan, Hanna Outakoski,and Asbjørg Westum. 2016. Researching literacydevelopment in the globalised North: studying tri-lingual children’s english writing in Finnish, Nor-wegian and Swedish Sápmi. In David R. Coleand Christine Woodrow, editors, Super Dimensionsin Globalisation and Education, Cultural Studiesand Transdiciplinarity in Education, pages 55–68.Springer, Singapore.

Marcin Milkowski. 2010. Developing an open-source,rule-based proofreading tool. Softw., Pract. Exper.,40(7):543–566.

Sjur N. Moshagen, Tommi A. Pirinen, and TrondTrosterud. 2013. Building an open-source develop-ment infrastructure for language technology projects.In NODALIDA.

Hanna Outakoski. 2013. Davvisámegielat cálamáhtukonteaksta [The context of North Sámi literacy].Sámi diedalaš áigecála, 1/2015:29–59.

SIKOR2016. 2016-12-08. SIKOR UiT The Arc-tic University of Norway and the NorwegianSaami Parliament’s Saami text collection. URL:http://gtweb.uit.no/korp (Accessed 2016-12-08).

Gary F. Simons and Charles D. Fennig, editors. 2018.Ethnologue: Languages of the World, twenty-firstedition. SIL International, Dallas, Texas.

Joel R. Tetreault, Keisuke Sakaguchi, and CourtneyNapoles. 2017. JFLEG: A fluency corpus and bench-mark for grammatical error correction. In Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics,EACL 2017, Valencia, Spain, April 3-7, 2017, Vol-ume 2: Short Papers, pages 229–234.

Linda Wiechetek. 2012. Constraint Grammar basedcorrection of grammatical errors for North Sámi. InProceedings of the Workshop on Language Technol-ogy for Normalisation of Less-Resourced Languages(SALTMIL 8/AFLAT 2012), pages 35–40, Istanbul,Turkey. European Language Resources Association(ELRA).

Linda Wiechetek. 2017. When grammar can’t betrusted – Valency and semantic categories in NorthSámi syntactic analysis and error detection. PhDthesis, UiT The arctic university of Norway.

55


Corpus of usage examples: What is it good for?

Timofey ArkhangelskiyUniversität Hamburg / Alexander von Humboldt Foundation

[email protected]

Abstract

Lexicography and corpus studies of grammarhave a long history of fruitful interaction. Forthe most part, however, this has been a one-way relationship. Lexicographers have exten-sively used corpora to identify previously un-detected word senses or find natural usage ex-amples; using lexicographic materials whenconducting data-driven investigations of gram-mar, on the other hand, is hardly common-place. In this paper, I present a Beserman Ud-murt corpus made out of “artificial” dictionaryexamples. I argue that, although such a corpuscan not be used for certain kinds of corpus-based research, it is nevertheless a very use-ful tool for writing a reference grammar of alanguage. This is particularly important in thecase of underresourced endangered varieties,which Beserman is, because of the scarcityof available corpus data. The paper describesthe process of developing the Beserman usageexample corpus, explores its differences com-pared to traditional text corpora, and discusseshow those can be beneficial for grammar re-search.

1 Introduction

Following a widely acknowledged idea that thelanguage above the phonological level can beroughly split into lexicon and grammar, languagedocumentation is divided into two interconnectedsubfields, lexicography and grammar studies. Cor-pora have been used since their advent in boththese domains; one of the first studies of Englishbased on the Brown corpus (Francis and Kucera,1982) contained frequency analysis for both wordsand parts of speech. It was recognized early onin the history of corpora that they are an excel-lent source of usage examples for both dictionar-ies and reference grammars. This gave rise to adata-driven approach to language documentation,which prescribes using only “real” examples taken

from corpora in descriptive work (Francis, 1993).Corpora have become standard providers of dictio-nary examples, which can even be searched for au-tomatically (Kilgarriff et al., 2008). However, thisapproach can hardly be applied to underresourcedendangered languages because it relies on largerepresentative corpora, which are normally un-available for such languages. Grammatical stud-ies based on small spoken corpora with minimalhelp of additional elicitation are often possible andhave been conducted, e.g. by Khanina (2017) forEnets or by Klumpp (2005) for Kamas. The samecannot be said about lexicography. While evensmall corpora are a valuable source of usage ex-amples for dictionaries, the lexicographer has toelicit examples for non-frequent or obsolete en-tries or word senses. These examples usually stayin the dictionary and are not used for any non-lexicographic research.

I argue that such elicited usage examples canbe turned into a corpus, which can actually proveto be helpful for a variety of grammar studies, es-pecially in the absence of large “natural” corpora.An example of a feature that cannot be available intraditional corpora, but can appear in elicited ex-amples, is negative linguistic material. The paperdescribes a corpus of usage examples I developedfor the documentation of Beserman Udmurt. Itpresents the data and the methods I used to developthe corpus. After that, its frequency characteristicsare compared to those of traditional spoken cor-pora. Finally, its benefits and disadvantages forcertain kinds of grammar research are discussed.

2 The data

Beserman is classified by Kelmakov (1998) as oneof the dialects of Udmurt (Uralic > Permic). It isspoken by approximately 2200 people who belongto the Beserman ethnic group, mostly in North-

56

Western Udmurtia, Russia. Having no officialorthography, it remains almost entirely spoken.This, together with the fact that transmission tochildren has virtually stopped, makes it an endan-gered variety. It differs significantly from Stan-dard Udmurt in phonology, lexicon and grammar,which justifies the need for separate dictionariesand grammatical descriptions. The two existinggrammatical descriptions, by Teplyashina (1970)and Lyukina (2008), deal primarily with phonol-ogy and morphemics, leaving out grammatical se-mantics and syntax.

Beserman is the object of an ongoing documen-tation project, whose goal is to produce a dictio-nary and a reference grammar, accompanied bya spoken corpus. The dictionary, which nearscompletion, contains approximately 5500 entries,most of which have elaborate descriptions of wordsenses and phraseology, and illustrated with exam-ples. The annotated texts currently count about103,000 words, which are split into two collec-tions, sound-aligned corpus (labeled further SpC)and not sound-aligned corpus.1

3 Development of the corpus

The Beserman usage example corpus (labeled fur-ther ExC) currently contains about 82,000 wordsin 14,000 sentences. Each usage example isaligned with its translation into Russian2 and com-ments. The comments tier contains the informa-tion about the number of speakers the examplewas verified with, and remarks on possible mean-ings or context. The main source of examples forthe corpus was the dictionary, however a smallpart (5% of the corpus) comes from grammati-cal questionnaires. The dictionary examples werecollected using several different techniques. First,there are translations from Russian (often modi-fied by the speakers in order to be more informa-tive and better describe their culture). Second,many examples were generated by the speakersthemselves, who were trained to do so by the lin-guists. Finally, some of the examples were pro-

1Both the dictionary in its current version and the cor-pora are publicly accessible at http://beserman.ruand http://multimedia-corpus.beserman.ru/search.

2Since the native speakers, including heritage speakers,are the primary target audience of the dictionary, and they arebilingual in Russian, the examples are translated into Rus-sian. Translation of the dictionary into English has started re-cently, but no English translations of examples are availableat the moment.

duced by the linguists (none of who is a nativeBeserman speaker) and subsequently checked andcorrected by the speakers. The development of thecorpus included filtering and converting the sourcedata, performing automatic morphological annota-tion, and indexing in a corpus platform with a webinterface.

3.1 Data preparation

The dictionary is stored in a database created bythe TLex dictionary editor, which allows export ofthe data to XML. The XML version of the dictio-nary was used as the source file for filtering andconversion.

The filtering included three steps. First, the Be-serman dictionary contains examples taken fromthe spoken corpus. Such examples had to be fil-tered out to avoid duplicate representation of thesame data in the two corpora. Identifying suchexamples might be challenging because they arenot consistently marked in the dictionary database,and because some of them were slightly changed(e.g. punctuation was added or personal data wasremoved). To find usage examples that come fromthe spoken corpus, each of them was comparedto the sentences of the corpus. Before the com-parison, all sentences were transformed to lower-case, and all whitespaces were removed. If no ex-act matches were found, the Levenshtein distanceto the corpus sentences of comparable length wasmeasured. The cut-off parameters were obtainedempirically to achieve a balance between preci-sion and recall. If a corpus sentence with a dis-tance of max(2, len/8) was found, where len isthe length of the example in question, the exam-ple was discarded. Additionally, the “comment”field was checked. The example was discarded ifit contained an explicit indication that it had beentaken from a text.

Second part of the filtering involved deduplica-tion of examples. Deduplication was necessarybecause some usage examples appeared in multi-ple entries. In this case, their versions could haveminor differences as well, e.g. because of typocorrections that were made in only one of them.Two sentences were considered identical if boththeir lowercase versions and their lowercase trans-lations had the Damerau–Levenshtein distance notgreater than len/10.

Finally, the dictionary is a work in progress.As such, it contains a number of examples that

57

have not been proofread or have not passed suf-ficient verification, which means checking withthree speakers, as a rule. The goal of the third stepof filtering was to include in the corpus only reli-able examples that do not require additional ver-ification. This was done using a combination ofcomments and available metadata. The examplesfrom the sections that have been entirely proofreadfor publishing were all included. Examples fromother sections were only included if they had indi-cation in the comment that they have been checkedwith at least two speakers. This excluded a consid-erable number of examples from the verbal part ofthe dictionary, which is under construction now.The total share of examples that come from theverbal part is currently just above 9%, despite thefact that verbal examples in the dictionary actuallyoutnumber examples from all other parts taken to-gether. As a consequence, the total size of thecorpus is likely to increase significantly after theverbal part has been finalized, probably reaching150,000 words.

An important difference between a traditionalcorpus and a collection of usage examples is thatthe latter may contain negative examples, i.e. sen-tences which are considered ungrammatical by thespeakers. About 9.5% of sentences in the Be-serman corpus of usage examples contain nega-tive material. Following linguistic tradition, thesentences or ungrammatical parts of sentences aremarked with asterisks. Other gradations of gram-matical acceptability include ∗? (ungrammaticalfor most speakers) and ?? (marginal / ungrammat-ical for many speakers). Negative examples areidentified by the converter based on the presenceof such marks in their texts and by the metadata.

The resulting examples are stored as tab-delimited plain text files. Each file contains exam-ples from one dictionary entry or one grammaticalquestionnaire; positive and negative examples arekept in separate files. Each example is stored ona separate line and contains the original text, itsRussian translation, and comments. Additionally,all filenames are listed in a tab-delimited file to-gether with their metadata.

3.2 Morphological annotation

A workflow used in the Beserman documenta-tion project prior to 2017 involved transcription ofrecordings for the spoken corpus in a simple texteditor and manual annotation in SIL FLEX. That

workflow was abandoned, mainly because manualannotation required too much resources, given theamount of recorded data that had to be processed.The new workflow comprises transcription, trans-lation and alignment of recordings in ELAN, sub-sequent automatic morphological annotation andautomatic disambiguation. Such an approach hasbeen demonstrated to be well suited for process-ing spoken corpora of comparable size by Gersten-berger et al. (2017).

Developing a rule-based morphological ana-lyzer would be a time-consuming task in itself,which is why Waldenfels et al. (2014) and Ger-stenberger et al. (2017) advocate for transcribingin standard orthography and applying an analyzerfor the standardized variety of the language. TheBeserman case is different though because therealready exists a digital dictionary. Using the dic-tionary XML as a source, I was able to producea grammatical dictionary with the description oflemmata and inflection types. The dictionary wasmanually enhanced with additional informationrelevant for disambiguation, such as animacy fornouns or transitivity for verbs. Apart from that,several dozen frequent Russian borrowings, absentfrom the dictionary, were added to the list. Cou-pled with a formalized description of morphology,which I compiled manually, it became the basis forthe automatic Beserman morphological analyzer.This analyzer is used for processing both new tran-scriptions in ELAN and usage examples.

After applying the analyzer to the texts, eachtoken is annotated with a lemma, a part of speech,additional dictionary characteristics, and all inflec-tional morphological features, such as case andnumber. Annotated texts are stored in JSON. Ifthere are several analyses possible for a token, theyare all stored. The resulting ambiguity is then re-duced from 1.7 to 1.35 analyses per analyzed to-ken by a set of 87 Constraint Grammar rules (Bickand Didriksen, 2015). The idea was to apply onlythose rules which demonstrate near-absolute pre-cision (at least 98%), while leaving more complexambiguity cases untouched. The resulting cover-age of the analyzer is 96.3% on the corpus of us-age examples. While this number might look un-usually high, it is in fact quite expected, given thatthere should be a dictionary entry for any non-borrowed word that occurs in any dictionary ex-ample.

58

3.3 IndexingThe annotated texts are uploaded to a server andindexed, after which they are available for searchthrough a web interface using an open-sourceTsakoprus corpus platform.3 The interface al-lows making simple and multiword queries. Thequeries may include constraints on word, lemma,part of speech, morphological tags, glosses, or anycombinations. In word and lemma search, wild-cards and regular expressions are available. Thesearch can produce a list of sentences or a list ofwords that conform to the query, together withsome statistical data. The Russian translationsare also morphologically analyzed and searchable.The interface is available in English and Russian,and several transliteration options are available fordisplaying the texts.

4 Differences from spoken corpora

For comparison, I will use the sound-aligned partof the Beserman spoken corpus (SpC), which wasmorphologically annotated and indexed the sameway the Corpus of usage examples (ExC) was pro-cessed. At the moment, it contains about 38,000words in monologues, natural dialogues and dia-logues recorded during referential communicationexperiments. To account for the difference in size,all frequencies will be presented in items per mil-lion (ipm).

Two obvious factors make ExC quite differentfrom SpC frequency-wise. First, the former con-tains comparable amount of usage examples forboth frequent and non-frequent words. As a con-sequence, it has a different frequency distributionof words and lemmata. Its lexical diversity, mea-sured as overall type/token ratio, is visibly higherthan that of SpC (0.224 vs. 0.179), despite the factthat it is more than twice as large.

Second, elicited examples constitute a genrevery different from natural narratives and dia-logues. For example, they contain a much smallernumber of discourse particles or personal pro-nouns. Table 1 demonstrates how different the fre-quencies of certain lexical classes are in the twocorpora.

Additionally, there are more different forms at-tested for a single lemma on average in ExC thanin SpC (Table 2). Although unequal size of thecorpora being compared could play a role here, the

3https://bitbucket.org/tsakorpus/tsakonian_corpus_platform/src

POS / Lexical class SpC ExCnoun 206K 400Kverb 213K 289K

adjective 52K 68Kpronoun 177K 123K

discourse ptcl. 72K 22K

Table 1: Frequencies (in ipm) of certain parts ofspeech and lexical classes in SpC and ExC.

POS SpC ExCnoun 3.15 4.44verb 4.75 6.08

adjective 2.16 2.56pronoun 2.72 2.92

Table 2: Average number of different forms per lemmafor certain parts of speech in SpC and ExC.

difference could be explained at least in part by thefact that the lexicographers had a goal of providingeach word with a handful of examples containingthat word in different forms and in different syn-tactic positions.

However, the analysis of value distributions ofindividual grammatical categories within a givenpart of speech reveals that they are usually notdrastically different in the two corpora. Let ustake nouns as an example. Nominal categories inUdmurt are case, number and possessiveness. Ta-ble 34 shows the distribution of case forms in thetwo corpora (all 8 spatial cases were collated inthe last row). Table 4 shows the distribution ofnumber forms. Table 5 shows the distribution ofpossessive suffixes.5

Case and number distributions only have mi-nor differences in SpC and ExC. Moreover, ananalysis of combinations of case and number suf-fixes shows that the distributions of their combi-nations also look very much alike. Possessivenesspresents a somewhat different picture. Althoughnot entirely different from SpC, the distributionin ExC shows lower figures for 2sg, 3pl, and es-pecially 3sg, and higher ones for the first personpossessors. Lower numbers for 3sg and 2sg havea straightforward explanation. Apart from being

4The numbers in each column add up to slightly more than100% because of the remaining ambiguity.

5Nouns may occur without possessive marking; onlynouns marked for possessiveness were included. Figures for2pl were verified manually because of large-scale ambiguitybetween nom,2pl and pl,acc.

59

case SpC ExCnominative / unmarked 60% 65.7%

accusative (marked) 4.7% 9.9%dative 2% 1.2%

genitive 1.8% 2.2%2nd genitive (ablative) 0.74% 1.6%

instrumental 7% 4.7%caritive 0.025% 0.024%

adverbial 0.25% 0.13%all spatial cases 24.6% 20%

Table 3: Share of different case forms for nouns inSpC and ExC.

number SpC ExCsg 94.6% 92.7%pl 5.4% 7.3%

Table 4: Share of different number forms for nouns inSpC and ExC.

possessor SpC ExC1sg 13.2% 28.2%1pl 1.7% 5.6%2sg 13.5% 7.7%2pl 0.2% 0.4%3sg 63% 54.8%3pl 8.4% 3.3%

Table 5: Share of different possessive forms forpossessive-marked nouns in SpC and ExC.

used in the direct, possessive sense, these partic-ular suffixes have a range of “discourse”, non-possessive functions. This is also true for Stan-dard Udmurt (Winkler, 2001) and other Uralic lan-guages (Simonenko, 2014). The 3sg possessivemarks, among other things, contrastive topics andsemi-activated topics that have to be reactivatedin the discourse. The 2sg possessive is also usedwith non-possessive meanings, although less of-ten, only in dialogues and in other contexts thanthe 3sg. Its primary discourse function is mark-ing a new referent that is located physically ormetaphorically in the domain of the addressee.Example 1, taken from a dialogue, shows both suf-fixes in non-possessive functions:

(1) Vajbring.IMP

so-ize=nothat-P.3SG.ACC=ADD

goz“@-de!rope-P.2SG.ACC

‘Bring me that rope as well!’

tense SpC ExCpresent 43.8% 39.4%

past (direct) 30% 43.2%future 26.2% 17.4%

Table 6: Share of different tense forms for finite verbsin SpC and ExC.

The 3sg possessive on “that” marks contrast:another rope was discussed earlier in the conver-sation. The 2sg possessive on “rope” indicatesthat this particular rope has not been mentioned inthe previous discourse and should be identified bythe addressee. The rope is located next to the ad-dressee. That the direct possessive sense is ruledout here follows from the fact that the whole di-alogue happens on the speaker’s property, so therope belongs to him, rather than to the addressee.Such “discourse” use of the possessives in elicitedexamples is quite rare because they normally re-quire a wider context to appear. A possible ex-planation for the lower 3pl figure in ExC is that itis often used when the plural possessor was men-tioned earlier and is recoverable from the context.

A quick look at the distribution of verbal tensesin Table 66 shows a picture similar to that of thepossessives. The distributions are not wildly dif-ferent, but the past tense is clearly more frequentin ExC. This is expected because a lot of usage ex-amples, especially for obsolete words, contain in-formation about cultural practices connected to theitem being described that are no longer followed.In this respect, the tense distribution in ExC re-sembles the one in the narratives, which usuallydescribe past events.

5 Fitness for grammar studies

The way a corpus of usage examples is differ-ent from traditional corpora makes it unfit forsome kinds of linguistic research. It cannot beused in any study that looks into long-range dis-course effects, requires a context or a conver-sation involving several participants. This in-cludes, for example, studies of discourse parti-cles, use of anaphoric pronouns, or informationstructure (topic/comment; given/new information,etc.). Similarly, it is useless for studies that rely on

6Only finite forms were counted. The fourth tense, thesecond (evidential) past, was not included because most ofits forms are ambiguous with a much more frequent nominal-ization, to which it is historically related.

60

word or lemma frequency counts, or on type/tokenratios, as those may differ significantly from tradi-tional corpora.

All downsides listed above are quite pre-dictable; in fact, it was hardly necessary to builda corpus of usage examples to arrive at those con-clusions. Whether such a corpus could be of anyvalue in other kinds of linguistic research, is a lesstrivial question. The observations in Section 4suggest that the answer to that question is posi-tive. While the Beserman corpus of usage exam-ples differs from the spoken corpus in the placeswhere it could be predicted to differ, it is remark-able how statistically similar the two corpora arein all other respects. The relative frequencies ofgrammatical forms are only different (although notextremely different) if these forms convey deic-tic or discourse-related meanings. In other cases,the forms have very similar distributions. Thismeans that corpora of examples in principle canbe used in linguistic research that involves com-paring frequencies of certain grammatical formsor constructions, with necessary precautions.

Only research that involves comparing fre-quency counts or distributions of linguistic phe-nomena has been discussed so far. A lot of gram-mar studies, however, only or primarily need theinformation about (un)grammaticality or produc-tivity of a certain phenomenon. It turns out thata corpus of usage examples could be actually su-perior to a traditional corpus of a comparable sizefor such studies. The reason for that is higher lex-ical diversity and more uniform frequency distri-bution of lemmata. This means that for any formor construction being studied, the researcher willsee more different contexts involving it than in atraditional corpus, on average.

Let us take the approximative case as an exam-ple. This case, which has the marker -lan in all Ud-murt varieties where it exists, marks the Groundin the approximate direction of which the Figureis moving. Although it is claimed to exist in liter-ary Udmurt by all existing grammars, corpus stud-ies reveal that it only functions as an unproduc-tive derivational suffix compatible with a handfulof nominal, pronominal and adjectival stems. Tolearn whether it can be considered a productivecase suffix in Beserman, its compatibility with awider range of nouns should be established.

The approximative has similar relative frequen-cies in ExC and SpC: 1858 ipm and 1750 ipm, re-

spectively. In both corpora, the approximative wasnot a primary focus of investigation. The numberof different contexts it is encountered in, however,is much higher in ExC. In SpC, there are 16 dif-ferent types that contain an approximative suffix,featuring 10 different stems. Only one of thosestems (ulca ‘street’) does not attach the deriva-tional approximative suffix in Standard Udmurt.This is definitely insufficient to establish its pro-ductiveness in Beserman. ExC, however, contains32 different types that belong to 25 different stems.Such a difference can hardly be ascribed to thelarger size of ExC because the number of differ-ent types is expected to have slower-than-lineargrowth with respect to the corpus size, and thetype/stem ratio is expected to go up rather thandown. Out of these stems, at least 5 are incompat-ible with the approximative in Standard Udmurt,including reka ‘river’, korka ‘house’ and šund“@‘sun’.7 Another 5 come from negative examplesthat highlight the incompatibility of the approxi-mative with certain inflected postpositions (rela-tional nouns). All this proves that it is most prob-ably a productive suffix, while outlining the limitsof its productivity.

Comparison of the figures for the recessive suf-fix, which also was not the focus of investigationin either corpus, yields similar results. The reces-sive case, with a marker -lasen, is the semanticopposite of the approximative and does not existin the literary language even as a derivational suf-fix. In SpC, there are 7 different types that containit. All of them have different lemmata, but again,only one of the types (k@tlasen ‘from the side ofthe belly’) suggests that they might not constitutea closed set of denominal adverbs. By contrast,ExC has 26 different types, containing 26 differentstems, 4 out of which come from negative exam-ples.

The skewed distribution of parts of speech couldbe beneficial too. For example, there is a con-struction in Udmurt that involves juxtaposition ofan unmarked nominal stem to another noun, e.g.t’ir n“@d ‘axe handle’. It exists in other Uralic lan-guages and has been analyzed as compounding byFejes (2005). However, its productivity, allowed

7The 65,000-word FLEX part of the spoken corpus, notcounted here, contains 21 different stems. However, this isonly because it includes transcriptions of referential commu-nication experiments specifically designed to make the par-ticipants use spatial cases and postpositions with a variety ofnouns.

61

possessive relations and constraints on the depen-dent noun (animacy, referential status, etc.) varysignificantly across Uralic languages and dialects.A detailed investigation of that construction wouldtherefore require looking at a large number of in-stances featuring a variety of dependent nouns.A search for such a construction yields 77 dif-ferent dependent nouns in SpC (total frequency7525 ipm) and 449 different nouns in ExC (totalfrequency 25132 ipm).8 In this case, the tremen-dous increase in numbers stems from the increasedshare of nouns in ExC, rather than from an in-creased lexical diversity. Nouns are almost twiceas likely to appear in ExC than in SpC. There-fore meeting a sequence of two nouns would bealmost 4 times higher if they were generated inde-pendently. The independence condition does nothold for words in a sentence, of course, but the ob-served factor of 3.34 is still large enough to makethe corpus of examples a more attractive tool forstudying the juxtaposition construction.

Research questions that arise when writing areference grammar mostly belong to the type dis-cussed above. Concise description of morpholog-ical and syntactic properties of an affix or a con-struction mostly require looking at a range of di-verse examples to determine their productivity andconstraints. The three case studies above showthat this is exactly the situation where a corpus ofusage examples could outperform a similarly sizedspoken corpus.

Apart from the aforementioned considerations,which would probably be valid for a corpus basedon any comprehensive dictionary, there are spe-cific features of the Beserman usage examplecorpus that may become beneficial for research.Most importantly, it contains elicited examplesfrom grammatical questionnaires, both positiveand negative. Although their share is too smallto reach any conclusions about their usefulness atthe moment, this could be a first step to the reuseof elicited data in subsequent research. Currently,data collected through questionnaires is virtuallyalways used only once. At best, the examples arearchived to support the analysis based on them andensure that the findings are reproducible; at worst,they are discarded and forgotten. Of course, eachquestionnaire is tailored to the particular researchquestion of its author. However, it is probable that

8To avoid as much ambiguity as possible without manualfiltering, the search was narrowed down to unambiguouslyannotated tokens.

if the corpus is large enough, it will contain exam-ples that could prove useful for the research ques-tions other than their author had in mind. Presenceof negative material could partially reduce the gen-eral aversion to corpora that the linguists workingin the generative paradigm tend to have. Availabil-ity of such corpora for multiple languages will alsofacilitate typologically oriented research on syn-tax, which otherwise relies on manual work withreference grammars and other secondary sources.

Finally, the Beserman corpus of usage exam-ples, as well as any corpus based on a bilingualdictionary, is in essence a parallel corpus. Thiscould be used for both linguistic needs (see e.g.Volk et al. (2014) for the list of possibilities) andfor developing or evaluating machine translationsystems.

6 Conclusion

Having a comprehensive dictionary in a machine-readable form and a morphological analyzer al-lows one to create a corpus of usage examplesrather quickly. Its size could be comparable tothose of large spoken corpora, which tend to havea maximum of 150,000 tokens for unwritten en-dangered languages. Even if a spoken corpus isavailable, this is a significant addition. Some-times, however, dictionary examples could be theonly large source of linguistic data, e.g. in thecase of extinct languages. It is therefore impor-tant to know how the data of usage examples cor-responds to that of spoken texts, and for what kindof linguistic research they are suitable. An analy-sis of the Beserman corpus of usage examples re-veals that it can actually be used as a reliable toolin a wide range of research on grammar. Obviouslimitations prevent its use in studies that involveword and lemma counts, discourse or informationstructure. However, outside of these areas, usageexamples are quite similar to natural texts, whichjustifies use of such corpora. Increased lexical di-versity and more uniform word distributions makecorpora of usage examples even more useful forsome kinds of research than traditional corpora ofsimilar size. Finally, such corpora can addition-ally contain data from questionnaires and negativematerial, which could facilitate their reuse.

Acknowledgements

I am grateful to all my colleagues who haveworked on the documentation of Beserman Ud-

62

murt and especially to Maria Usacheva. This re-search would have not been possible without theirprolonged lexicographic work.

ReferencesEckhard Bick and Tino Didriksen. 2015. CG–3 - Be-

yond Classical Constraint Grammar. In Proceed-ings of the 20th Nordic Conference of Computa-tional Linguistics (NODALIDA 2015), pages 31–39.Linköping University Electronic Press.

László Fejes. 2005. Összetett szavak finnugor nyelvek-ben [Compound words in Finno-Ugric languages].PhD, Eötvös Loránd Tudományegyetem, Budapest.

Gill Francis. 1993. A corpus-driven approach to gram-mar: Principles, methods and examples. In MonaBaker, Gill Francis, and Elena Tognini-Bonelli, ed-itors, Text and technology: In honour of John Sin-clair, pages 137–156. John Benjamins, Amsterdam& Philadelphia.

W. Nelson Francis and Henry Kucera. 1982. FrequencyAnalysis of English Usage: Lexicon and Grammar.Houghton Mifflin, Boston.

Ciprian Gerstenberger, Niko Partanen, and MichaelRießler. 2017. Instant annotations in ELAN cor-pora of spoken and written Komi, an endangeredlanguage of the Barents Sea region. In Proceedingsof the 2nd Workshop on the Use of ComputationalMethods in the Study of Endangered Languages,pages 57–66. Association for Computational Lin-guistics.

Valentin Kelmakov. 1998. Kratkij kurs udmurtskoj di-alektologii [Brief course of Udmurt dialectology].Izdatel’stvo Udmurtskogo Universiteta, Izhevsk.

Olesya Khanina. 2017. Digital resources for Enets: Adescriptive linguist’s view. Acta Linguistica Aca-demica, 64(3):417–433.

Adam Kilgarriff, Miloš Husák, Katy McAdam,Michael Rundell, and Pave Rychlý. 2008. GDEX:Automatically finding good dictionary examples ina corpus. In Proceedings of the XIII EURALEX In-ternational Congress, pages 425–432, Barcelona.

Gerson Klumpp. 2005. Aspect markers grammatical-ized from verbs in Kamas. Acta Linguistica Hun-garica, 52(4):397–409.

Nadezhda Lyukina. 2008. Osobennosti jazyka balezin-skix i jukamenskix besermjan [The peculiarities ofthe language of Balezino and Yukamenskoye Beser-mans]. PhD, Udmurt State University, Izhevsk.

Aleksandra Simonenko. 2014. Microvariation inFinno-Ugric possessive markers. In Proceedings ofthe Forty-Third Annual Meeting of the North EastLinguistic Society (NELS 43), volume 2, pages 127–140.

Tamara Teplyashina. 1970. Jazyk besermjan [The lan-guage of the Beserman]. Nauka, Moscow.

Martin Volk, Johannes Graën, and Elena Callegaro.2014. Innovations in Parallel Corpus Search Tools.In LREC 2014 Proceedings, pages 3172–3178.

Ruprecht von Waldenfels, Michael Daniel, and NinaDobrushina. 2014. Why standard orthography?Building the Ustya River Basin corpus, an onlinecorpus of a Russian dialect. In Computational Lin-guistics and Intellectual Technologies. Papers fromthe Annual International Conference "Dialogue",volume 13, pages 720–728.

Eberhard Winkler. 2001. Udmurt. Number 212 in Lan-guages of the world. Lincom Europa, Munich.

63


A Preliminary Plains Cree Speech Synthesizer

Atticus [email protected]

Timothy [email protected]

Antti [email protected]

Alberta Language Technology LaboratoryUniversity of AlbertaEdmonton, Alberta

AbstractThis paper discusses the development andevaluation of a Speech Synthesizer for PlainsCree, an Algonquian language of NorthAmerica. Synthesis is achieved using Sim-ple4All and evaluation was performed usinga modified Cluster Identification, Semanti-cally Unpredictable Sentence, and a basic di-chotomized judgment task. Resulting synthe-sis was not well received; however, obser-vations regarding the process of speech syn-thesis evaluation in North American indige-nous communities were made: chiefly, thattolerance for variation is often much lowerin these communities than for majority lan-guages. The evaluator did not recognize gram-matically consistent but semantically nonsensestrings as licit language. As a result, monosyl-labic clusters and semantically unpredictablesentences proved not the most appropriateevaluate tools. Alternative evaluation methodsare discussed.

1 Introduction

While majority languages such as English provideample data for the creation and training of speechrecognition, corpus annotation, and general lan-guage models, under-resourced languages are in aunique position to benefit greatly from such tech-nologies. Speech Synthesizers, mobile keyboards,in-browser reading guides and smart dictionariesall provide invaluable tools to help aid languagelearners in gaining proficiency in languages wherespeaker numbers are falling. With the ubiquity ofspeech synthesis systems in public transit, emer-gency broadcast systems, and (most notably) mo-bile phones, under-resourced language communi-ties often clamour for such technology. In addi-tion to the positive social implications of havingyour language associated with technological inno-vation, speech synthesis systems provide a veryreal benefit in endangered and under-resourced

language communities: while it is unfeasible forelders and remaining fluent speakers to detail ev-ery possible word and story, an ideal speech syn-thesizer allows for a learner to hear, on demand,any word, phrase, sentence, or passage. Despitethis obvious benefit, endangered language groupsrarely have such technology available, especiallyin the context of North American Indigenous lan-guages.

This paper details the development of an earlysynthesizer for Plains Cree, an Indigenous lan-guage of Canada, and an evaluation of the result-ing system. Through synthesis via the Simple4Allsuite of programs, this paper documents the prosand cons of such a system, and investigates how aspeech synthesizer can be evaluated in the contextof North American Indigenous languages.

2 Plains Cree Phonology

Plains Cree is a polysynthetic language of theAlgonquian family spoken mainly in westernCanada. With a recorded speaker count of nearly35,0001, Plains Cree is classified as a stage 5language (or “developing”) according to the Ex-panded Intergenerational Transmission DisruptionScale (Ethnologue, 2016).

Much of the literature on the language has fo-cused on its morphosyntax. The phonetics andphonology have been less well described. In-deed, only five of the 50 pages of Wolfart’s (1996)sketch of the language deal with what can be cat-egorized as phonetics or phonology, and only fourpages of his earlier, more comprehensive, gram-mar were dedicated to the topic (Wolfart, 1973).In the former, which is the more phonologicallyinclined, Wolfart identifies the phoneme inventory

1Although this is the current cited number of speakers,it is likely that this is an optimistic count of speakers. Thetrue number is likely several thousand speakers lower, thoughreliable demographics are difficult to obtain.

64

of Plains Cree as containing eight consonants, twosemi-vowels/glides, and seven vowels (three shortand four long).2

Although Plains Cree vowels are distinguishedby length, it is also the case that the long vow-els are qualitatively different from the short vow-els (with the short vowels being less peripheralthan their long counterparts (Harrigan and Tucker,2015; Muehlbauer, 2012)). Further, the issue ofquality is exacerbated by the language’s systemof stress, which is poorly understood. Conso-nants show similar, though better understood, vari-ation. Stops are generally voiceless word initially,geminated when between two short vowels, andshow free-variation between voiced and voicelessrealization otherwise (Wolfart, 1996). The af-fricate /ts/ and fricative /s/ appear in seeminglyfree variation with [tS] and [S], respectively (Wol-fart, 1996). Plains Cree also possesses diphthongs,though only as allophonic realizations of vowelsadjacent to glides. When either long or short /a/,/o/, or /e/ occur before /j/ the diphthongs [aI], [oI],[eI] occur (respectively). When these come before/w/, the diphthongs [aU], [oU], and [eU] are real-ized (respectively). When short or long /i/ occursbefore /j/, it is lengthened; when short /i/ comesbefore /w/, [o] is realized, while long /i:/ in thesame position produces the diphthong [iU]

Possible syllables are described by Wolfart(1996, 431) as having an optional onset composedof a consonant (optionally followed by a /w/ if theconsonant is stop, an affricate, or /m/), a vowel nu-cleus, followed by an optional coda composed of asingle consonant or either an /s/ or /h/ followed bya stop/affricate. According to Wolfart (1996, 431)the vast majority of Cree syllables have no coda.

Plains Cree has a standardized orthography re-ferred to as the Standard Roman Orthography(SRO) The standard is best codified by Okimasisand Wolvengrey (2008) and makes use of a phone-mic representation, ignoring allophonic variationsuch as the stop voicing described above. Thestandard also offers morphophonological informa-tion; for example, although the third-person sin-gular apiw (‘S/he sits’) is pronounced /apo/, itis not written as such, so as to allow the readerto apply morphological rules such as inflectingfor the imperative by removing the third-person

2There is only one /e/ phoneme in the language and itis generally considered long for historical reasons. Initially,Plains Cree did have a short /e/, but this eventually convergedwith long /i/ (Bloomfield, 1946, 1).

morpheme, {-w}, and adding {-tan}. Were thethird-person form written <apow> or <apo>,one might incorrectly expect the imperative formto be /apota:n/, rather than the correct /apIta:n/.Vowels in the SRO are marked for length by usingeither a circumflex or macron over the vowel, and<e> is always to be written as long. When diph-thongs are allophonically present, the underlyingphonemes are used, as in the example of [apo] be-ing written as <apiw>. The orthography is mostlyshallow: the seven consonants (/p, t, k, ts, s, h, m,n/) are represented by single graphemes (< p, t, k, c, s, h, m, n>); short vowels (/a, i, o/) are writtenwithout any diacritic (<a, i, o>), while long vow-els (/a:, i:, o:, e:/) are written with a circumflex ormacron (<a, ı, o, e>)

While this standard has been codified inOkimasis and Wolvengrey (2008), it is not uni-versally (or even largely) adopted. While majorpublications such as Masuskapoe (2010), Minde(1997), or those put out by provincial/federal de-partments, are written in the SRO, many publica-tions and communications (especially those infor-mal) are done using nonstandard conventions. Inthe Maskwacıs Dictionary of Cree Words(1997),for example, <ch> is used in place of <c>, <h>is used (in addition to its regular phonemic rep-resentation) essentially to impart that a vowel isdifferent than the expected English pronunciation,and vowel length is not identified. The Alberta El-der’s Cree Dictionary (LeClaire et al., 1998) doesmark vowel length, though <e> is always writtenwithout length marking.

These orthographic variations, as well as thepaucity of data provide challenges for languagetechnology.

3 Speech Synthesis

Broadly speaking, speech synthesis comes in twomain forms: parametric and concatenative. Para-metric synthesis aims to recreate the particularrules and restrictions that exist to manipulate asound wave as in speech (Jurafsky and Martin,2009, 249). Concatenative synthesis, rather thanfocusing on recreating the parameters that producespeech, concerns itself with stitching together pre-existing segments to create a desired utterance (Ju-rafsky and Martin, 2009, 250).

The contemporary focus of speech synthesis isso-called text to speech (TTS) (Jurafsky and Mar-tin, 2009, 249), wherein input text is transformed

65

into sound. In this process, text is phonemicallyrepresented and then synthesized into a wave-form (Jurafsky and Martin, 2009, 250). Phone-mic representation is accomplished though vari-ous means: dictionaries of pronunciation offer upa transcription of common words and sometimeseven names, though these are almost guaranteedto not contain every single word a synthesizercould expect to encounter (Jurafsky and Martin,2009). As a result, other methods such as “letter-to-sound” rules (which aims to provide a phone-mic representation for a grapheme given a con-text) have also been employed (Black et al., 1998).Other units of representation have also been con-sidered, such as those smaller than the phoneme(Prahallad et al., 2006). In general, most contem-porary TTS systems generate a phonemic repre-sentation through machine learning (Jurafsky andMartin, 2009, 260). Black et al. (1998), for ex-ample, instituted a system wherein a computer isfed set of graphemes and a set of allowable pro-nunciations for each of those graphemes, alongwith the probabilities of a particular grapheme-to-phoneme pairing. More recent techniques suchas those described by Mamiya et al. (2013) in-stead take a subset of speech and text data thatare manually aligned and, using machine learningalgorithms, learn the most likely phonemic rep-resentation of each grapheme given the context.In any case, grapheme-to-phoneme alignment re-sults in a system that is able to produce sound fora given text sequence. A speech synthesis sys-tem will also provide intonational information tobest reflect naturalistic speech (Jurafsky and Mar-tin, 2009, 264).

While there have been speech synthesis effortsfor minority language (Duddington et al., n.d), lit-tle focus has been paid to North American lan-guages. Although there appears to be a Mohawkvoice available for the eSpeak open-source soft-ware project (Duddington et al., n.d), the onlypublished account of an Indigenous language syn-thesizer seems to be a 1997 technical report de-tailing a basic fully-concatenative synthesizer forthe Navajo language (Whitman et al., 1997). Inthis instance, the authors compiled a list of allpossible diphones (two-phoneme pairs) and hada Navajo speaker read these in a list (Whitmanet al., 1997, 4). These diphones were then man-ually segmented, concatenated, and adjusted fortone (as Navajo is a tonal language) (Whitman

et al., 1997). According to the authors, althoughthe system was small and lacked much of the dataone might prefer when building a speech synthe-sizer, the concatenative method they used man-aged to produce an intelligible synthesizer (Whit-man et al., 1997, 13). Other than this effort, itappears that speech synthesis for North Americanlanguages has been largely non-existent. The rea-son for this is likely due to the lack of resourcesin these languages. Languages of North Amer-ica may lack even a grammar, though many willhave a variety of recordings of important storiesor conversations (Arppe et al., 2016). Few lan-guages of North America have a standard and wellestablished written tradition. As a result, speechsynthesis development is necessarily difficult forthese languages. Plains Cree, as one of the mostwidely spoken indigenous languages of NorthAmerica with roughly 20,000 speakers (Harriganet al., 2017), provides relatively large amounts ofstandardized text for a North American language(Arppe et al., forthc.). The sources range frombiblical texts (Mason and Mason, 2000), to nar-ratives (Vandall and Douquette, 1987), and eveninterviews between fluent speakers (Masuskapoe,2010). The texts used for TTS synthesis in thispaper are described in Section 4.

4 Materials

4.1 Training Data

This study uses two varieties of materials: trainingdata and a TTS toolchain. Training data comesin the form of the biblical texts of Psalms fromCanadian Bible Society (2005). These texts, nar-rated by Dolores Sand, are accompanied by tran-scriptions in the SRO (Canadian Bible Society,2005). The audio files are uncompressed, stereoaudio with a 16 bit depth, 44,100 sampling fre-quency. Not all Psalms were available as an au-dio recording, though 52 files totaling 2 hours 24minutes and 50 seconds in length were availableas training data. Because the toolchain discussedbelow requires files with only a single channel forinput, the Psalm recordings were converted intomono-channel files using the SoX utility (Bagwell,1998–2013). Finally, the computation was com-pleted on a virtual server with an Intel 2.67 GHzXenon X5650 processor and 16 GB of RAM.

66

4.2 Tool Chain

Simple4All (Simple4All, 2011–2014) aims to de-velop lightly or unsupervised speech recogni-tion/synthesis (Simple4All, 2011–2014). Two ofthe major outputs of the project are ALISA, alightly supervised alignment system (Simple4All,2014), and Ossian, a front-end synthesizer (Sim-ple4All, 2013-2014). ALISA aligns based on tenminutes of pre-training data which has been man-ually segmented (Stan et al., 2013). The systemlearns and then attempts to align the rest of thetraining data and text transcript provided to it (Stanet al., 2013) (as detailed later, ALISA alignmentwas not particularly successful, and so hand align-ment was conducted.). Resulting from alignmentis a set of utterance sound files and text files withthe respective orthographic representation. Thesefiles are then fed directly to into Ossian which usesthese data to train itself and produces a synthe-sized voice. Ossian itself is a ‘collection of Pythoncode for building text-to-speech (TTS) systems,with an emphasis on easing research into build-ing TTS systems with minimal expert supervision’(Simple4All, 2013-2014).

5 Evaluation

Two iterations of the synthesizer were created.The first iteration was built on alignment from AL-ISA. In order to evaluate the synthesized voice, acombination of various common metrics was used.Both functional (i.e. intelligibility of the sys-tem (Hinterleitner, 2017, 24)) and judgment (i.e.how pleasant the system is to use) tests were im-plemented. Ideally, at least a dozen participantswould be used, but due to the limited number ofpeople who speak and are literate in the language,and being mindful of the need to not bias speakersfor future non-pilot level tests, one speaker wasdeemed appropriate for each iteration of the syn-thesis3.

For functional level analysis, a modification ofthe Cluster Identification Task (Jekosch, 1992)was used. In this modified task, basic V, CV,and CVC syllables (not words) were randomly

3The second iteration actually contained two evaluationsby the same participant due to the initial evaluation taskscontaining non standard spellings in the Semanticaly Unpre-dictable Sentence and Judgment tasks. Only these tasks werere-administered (with new stimuli verified for orthographicconsistency). The first evaluation of the second synthesizerwas substantially similar to the second evaluation and so willnot be detailed in this paper.

presented (at least twice, but as often as theparticipant requests) after which the participantwas asked to write down what they heard. Al-though complex onsets and codas exist in PlainsCree, they are certainly less frequent, as discussedabove. Although each possible syllable would ide-ally be presented at least once, the number of eval-uations would total nearly 2000 items, a task tooonerous for a single session and participant. In or-der to acclimate the participants, three test scenar-ios using English syllables from a native Englishspeaker (/skwi/, /cle/, and /ram/) were run.

In addition to the Cluster Identification Task,the participant was asked to take part in a modifiedSemantically Unpredictable Sentence task (Benoıtet al., 1996). In this task the participant was askedto listen to sentences that, while morphosyntac-tically correct, were semantically unlikely suchas e-mowat sehkew, ‘You eat the car’, where e-mowat licenses any grammatically animate nounlike sehkew, ‘car,’ but is much more likely to referto food than a vehicle.

After listening to the stimuli at least twice, theparticipant was asked to write down what theyheard. This test produces a situation whereinspeakers are less able to rely on context for dis-crimination (Hinterleitner, 2017) while makinguse of real words rather than just monosyllables.A total of 5 semantically regular sentences werealso presented so as to investigate how well oursystem works. As we would expect a greater levelof comprehension in these sentences, any other re-sult would indicate very poor performance.

To assess the pleasantness of synthesis, a set ofscales with opposing metrics was created. Scaleswhere end points fall beyond metric markingswere used (see Figures 1 and 2 for an example).This was done so as to avoid the tendency for par-ticipants to not rate at the ends of scales and thetendency for individuals to have difficulty in dis-tinguishing a stimuli that is either terrible or verygood (Hinterleitner, 2017). The scales were com-prised of the following pairs: Natural vs. Unnatu-ral, Pleasant vs. Unpleasant, Quick vs. Slow, andoverall Good vs Bad. Judgments were elicited fortwo utterances: one a an excerpt from the trainingdata, and one a synthesis of an utterance pulledfrom a corpus (Arppe et al., forthc.; Ahenakew,2000; Bear et al., 1992; Ka-Nıpitehtew, 1998; Ma-suskapoe, 2010; Minde, 1997; Vandall and Dou-quette, 1987; Whitecalf, 1993). The use of non-

67

synthesized data was used as a point of compari-son. The identities of these stimuli were not madeknown to the participant. Because the participantwas a former language instructor, general com-ments regarding the usefulness of the synthesizerwere invited.

6 Results

Although the methodologies described above werecarefully considered for the task of evaluation, ma-jor modifications had to be made once the evalu-ation was actually undertaken. The first iterationof the synthesizer proved difficult for evaluation,with the participant barely able to complete mostof the tasks. According to the participant, synthe-ses were unintelligible, unpleasant, too quick, andoverall bad. Listening to the syntheses throughheadphones was so jarring that evaluation wasdone through basic laptop speakers. Judgment as-sessment of longer utterances was extremely diffi-cult for the participant, and stimuli were deemedto be too long. The cause of this poor synthesisis likely due to the fact that ALISA managed toalign only 24 minutes of nearly 2.5 hours of train-ing data.

To respond to this issue, the entirety of the train-ing data was aligned by hand. This second itera-tion of the synthesizer was substantially more nat-ural, intelligible, and pleasant. As the first partic-ipant’s schedule was quite busy, a second partic-ipant was recruited. This participant was a male,middle-aged former Cree-language teacher who isproficient in the SRO. The following section de-tails the evaluation of the second iteration of thespeech synthesizer. All stimuli were presented ina randomized order.

6.1 Modified Cluster Identification Task

So as not to exhaust the participant, 70 clusterswere presented. Of the tested clusters, 21 wereidentified correctly in their entirety. Another 16were classified as minor errors (accepted variationin the orthograhy) such as 6 instances of <e> be-ing written as <ı> (likely the result of the twophonemes overlapping in vowel space (Harriganand Tucker, 2015) as well as perhaps the influenceof English orthography) and 7 long vowels be-ing written as short vowels following by an <h>(a common non-standard way of indicating thata vowel is long). These two types make up themajority (13/16) of minor errors. Together, mi-

nor errors and correctly identified clusters makeup the majority of responses (53%). There were11 items where the onset was misheard, with themajority of these (6) being <c> misheard as <s>.Given that <c> represents /ts/, this is not whollysurprising. Remaining error types were smaller intheir tokens: 3 clusters had a vowel identified cor-rectly, but the participant missed the onset entirely;4 vowels were identified with the correct qualitybut the wrong length; 6 clusters showed the wrongquality but the correct length; and 4 clusters wereheard as incorrect vowel with incorrect lengths.See Table 6.1 for a full list of stimuli and results.A highlighted row indicates a sentence where theparticipant’s transcription deviated from the input.Boldface letters indicate where the deviance oc-curred.

6.2 Semantically Unpredictable Sentences

Semantically Unpredictable sentences showedsimilar results. All sentences where the partic-ipant’s transcriptions varied from the SRO inputwere semantically unpredictable. The differencesin transcription were nearly always restricted todifferences in vowel length. In the one case wherevowel quality differed (kikı-sıkinik minos), the er-ror was an instance of input <ı> being written bythe participant as <e>. As mentioned previously,this variation is unsurprising due to overlappingvowel spaces. Table 6.2 summarizes the evalua-tion of the SUS task. Those sentences precededby a hash-mark are semantically unpredictable,while those without are semantically predictable.As above, a highlighted row indicates a sentencewhere the participant’s transcription deviated fromthe input, and boldface letters indicate where thedeviance occurred.

6.3 Judgment Tasks

Impressionistic judgment of the synthesizer showsthat the system performed worse than an actualnative-speaker production. Figure 1 represent theevaluation of the synthesizer, and Figure 2 the rep-resentation of the natural utterance. Unlike thefirst iteration of the synthesizer (which was al-most entirely rated as negatively as possible onall scales), the second synthesizer was rated assomewhat unnatural, unpleasant and bad, but notdrastically so. The synthesizer was rated as onlyslightly slower than the middle ground betweenquick and slow. The naturally produced stimuli

68

Stimulus Heard Stimulus Heard

kit kit yit ittit kit wiit wiitcoot ot moot mohtsat sat wot wotwoot rot niit neetmaat maat siit seetmiit miit soot satwit wit hot hottaat taat sit sithat hat seet seetkeet kit naat naatkot pat mot notkiit keet pot patpat at yeet yitteet NA sot sotcot sot cit sitwat wat paat paatnoot not heet hihtkaat kaat toot tohtnot not ceet sihthaat hat hiit nehtyaat yaat pit pityot eewt kat katciit seet piit peetkoot pat mat mattiit teet yat yeetpeet peet tot tothoot hoht cat setneet neet hit ahihtweet weet nit nitnat nat poot pohttat tat caat sahtyiit eet waat wahtyoot eewt meet mihtmit mit saat seht

Table 1: Cluster Identification Results

showed almost the opposite pattern, being mod-estly natural, pleasant, good, and slow.

Interestingly, the naturally produced utterancewas not judged to be maximally natural, pleasant,or good. This is likely due to the fact that the ut-terance was not of quick or conversational speech,but rather a performed recording of biblical text.

7 Discussion

Although a synthesizer was developed, alignmentthrough ALISA was unsuccessful, with just 24minutes of roughly 2.5 hours correctly aligned

Figure 1: Synthesized Voice Judgment Task Evaluation

Figure 2: Naturally Produced Voice Judgment TaskEvaluation

(19%). According to the developer, Adriana Stan,despite attempting to create an alignment systemfor a wide variety of languages, polysynthetic lan-guages such as Plains Cree were not consideredor tested; as a result, ALISA’s parameters dispre-fer very long words, instead assuming an errorin processing (p.c. Adriana Stan, Jan 31, 2018).Although ALISA allowed for some adjustmentin this respect, changing the average acceptablelength for a word to any extent did not produce anysignificant increase in alignment (either by low-ering or raising this threshold). Further, ALISA

69

Input Participant Transcription English Gloss

# e-pahpisit sehkew e-pahpisit sehkew ‘The car laughs.’# sıwinikan pimohtew sıwinikan pimohtew ‘Sugar walks.’# e-postiskawacik kinosewak e-postiskawacik kinosewak ‘You put the fishes on.’awasis wapahtam masinahikan awasis wapahtam masinahikan ‘A child sees a book’e-wiyatikosit iyiniw e-wiyatikosit iyiniw ‘The Indigenous person is happy’napew e-mıcisot napew e-mıcisot ‘The man is eating’kinepik nipaw kinepik nipaw ‘A snake is sleeping’#atim kiwı-saskamohitin atim kiwı-saskamohıtin ‘I am going to put a dog in your mouth.’#kikı-sıkinik minos kikı-sekinik minos ‘The cat poured you.’iskwew pıkiskatisiw iskwew pıkiskatisiw ‘A woman is sad.’

Table 2: Evaluated Sentences

seemed relatively successful for other agglutinat-ing languages such as Finnish and Turkish withsimilarly long words; this suggests that the issuewith alignment may be more complex than justword length.

The evaluation of hand-aligned data with Os-sian synthesizer showed promising results. Thesecond iteration of the synthesis was relativelywell received by the second participant, who re-marked that the speech synthesizer, while not per-fect, was serviceable and represented an excitingopportunity for language learners. Despite this,multiple issues arose throughout the evaluationprocess. Most significant was the issue of ortho-graphic proficiency. Because the tasks selected forthis evaluation relied on written responses, onlyparticipants with strong literacy in SRO could beconsidered. This is especially problematic for In-digenous languages of Canada, as many of thesevarieties are historically oral languages. Few flu-ent speakers of Plains Cree actually possess thelevel of literacy needed for the evaluation tasks.This severely restricts who can participate in eval-uation of the synthesizer. A side effect of thisrestriction was the scarce availability of a nativespeaker to review the stimuli (as one would pre-fer not to have reviewers act as participants in thestudy), leading to the several orthographic incon-sistencies in the first evaluation for both iterationsof the speech synthesizer. For the second iterationof the synthesizer, evaluation was re-administeredfor those items with orthographic inconsistencies.One solution to this issue is to eschew the needfor writing. Instead of writing down what theyhear, participants could be asked to provide word-to-word translations (insofar as possible) and tocompare these with the intended meaning of the

stimuli. Though this would not address the partic-ipants’ ability to recognize particular graphemes,the lack of a need for literacy would allow fora signficantly larger number of participants thanwhat would be available when requiring literacyin the SRO. Alternatively, participants could beasked to repeat what they have heard, though thiswould likely require a complete reworking of stim-uli.

In building a synthesizer for Plains Cree, thefull Simple4All toolchain proved unreliable. Inaddition to the issues faced in ALISA alignmentfor Plains Cree, documentation installation of thesoftware provided multiple challenges with lit-tle documentation for support. Future endeavorsshould consider newer systems such as the Mer-lin project (Wu et al., 2016) which has been us-ing Deep Neural Nets (DNN), a form of machinelearning that seem to provide better results thanthe Hidden Markov Models (HMMs) used in theSimple4All project. Although no such compari-son of DNN vs. other machine learning methodshas been reported for synthesis of North Americanlanguages, DNN based systems such as Google’sWaveNet report significantly better results thanother techniques such as HMMs (van den Oordet al., 2016). If continuing to use Simple4All, theresults of this evaluation suggest that one shouldnot use ALISA for text alignment. In addition,researchers should consider the use of prebuiltaligners for languages with similar phoneme in-ventories. In the case of Plains Cree, given thatthe phoneme inventory is a subset of English’s,one could consider the use of force aligners builtfor English with a modified dictionary specific toPlains Cree. This should be feasible in theory,though it remains to be seen whether it is useful

70

in practice.

In regards to evaluation, the SUS task mightbest be abandoned altogether in assessing futurePlains Cree synthesizers. Results showed very lit-tle difference between semantically unpredictablesentences and semantically predictable ones. Bothparticipants noted that the ‘unpredictable’ sen-tences were, in fact, ungrammatical: the line be-tween unlikely and allowable is very thin. It wouldbe interesting to repeat this task with other na-tive speakers, as it may be a participant-specificattribute (though this study’s participants had ex-perience as second-language instructors and werelikely far more familiar with listening, identifyingand interpreting odd, infelicitous, and/or mispro-nounced utterances than the average speaker). Ifthis is a tendency that holds across native speak-ers of Plains Cree, it may be worth investigatingthe factors influencing these attitudes (such as atendency for speakers of the language to be ei-ther very novice or very fluent, perhaps leadingto a lack of familiarity or tolerance of variationas seen in the SUS task). Assuming this attitudeholds for the general population, it would be bestto choose somewhat semantically predictable, butless frequent, stimuli (e.g. I ate the zebra, wherezebra is more predictable than car, but less pre-dictable than rabbit) or avoid the SUS task en-tirely. Of course, this means the confound of se-mantic predictability endures, though this mightbe addressed presenting words in isolation (ac-cepting that this does not allow for sentence levelprosody to be assessed).

Based on the feedback from the participant, re-ducing the number of syllables presented wouldbe beneficial, though the details of how to do soremain unclear, especially considering that thisstudy ignored a large portion of possible sylla-bles. Random sampling could be used by select-ing simply one type of each sound in each sylla-ble position (e.g. ensuring there is tested at leastone syllable starting with a stop and ending witha fricative). Spreading out the set over many par-ticipants, such that every syllable is evaluated thesame number of times but not by each participant,is perhaps a better solution as it allows every syl-lable to be evaluated; in this case, one would haveto analyze results via some sort of mixed-effectsmodel (where speaker acts as a random effect) toaccount for the variation between speakers. Fur-ther, this method requires many participants, an

inherent restriction in working with minority lan-guages, especially those of North America. Fi-nally, the first participant indicated that a few ofthe monosyllables, while unattested in dictionar-ies, were actually vulgarities. As this task wasmeant to assess only syllable intelligibility sepa-rately from word-level intelligibility, and due totheir offensive nature, it is important that futurestudies remove these words or at least warn partic-ipants of their possible presence.

8 Conclusion

This paper presents one of the first parametric syn-theses of an Indigenous language of Canada, us-ing the Simple4All packages ALISA and Ossian.Based on roughly 2.5 hours of speech, this methodof speech synthesis makes use of lightly super-vised forced alignment to ease the workload re-quired by the researcher. Although Simple4Allhas been used with a variety of languages (Sim-ple4All, 2011–2014), the forced alignment waslargely unsuccessful with the Plains Cree data. Noconclusive reason could be found to account forthis, though it may be that word length played afactor. In contrast, the results of the second synthe-sizer based on hand-aligned training data presentpromising results, with many of the stimuli be-ing understood in their entirety. Although thissecond synthesizer was clearly identified as non-natural speech, its output was intelligible and rela-tively well received by the participant. Where theparticipant’s transcription of stimuli deviated fromthe input, deviations generally concerned differentvowel lengths. The results of this paper also in-dicate that careful consideration must be given tothe evaluation frameworks, since those techniquesthat have become established and applied success-fully for majority languages may not be suitablefor Indigenous languages, at least for those in theCanadian context.

Acknowledgements

The authors of this paper would like to thank Do-lores Sand, the speaker on whom the synthesis wasbased for here willingness to have her data used inthis project. We would also like to thank MarttiVainio, Antti Suni, and Adriana Stan for not onlytheir work in developing Simple4All, but theirgenerosity in helping to troubleshoot throughoutthe synthesis. This research was supported by theSocial Sciences and Humanities Research Council

71

of Canada (Grants 890-2013-0047 and 611-2016-0207), the University of Alberta Kule Institute forAdvanced Study Research Cluster Grant, and aSocial Sciences and Humanities Research Coun-cil of Canada Doctoral Fellowship. We would alsolike to thank Arok Wolvengrey and Jean Okimasisfor their help in reviewing stimuli, Arden Ogg forhelping to facilitate our relationship with DoloresSand, and community members of Maskwacıs, Al-berta for their input throughout the process of de-veloping the synthesizer. Finally, we would like tothank our anonymous participants for their exper-tise, patience, and participation.

ReferencesAlice Ahenakew. 2000. ah-ayıtaw isi e-kı-

kiskeyihtahkik maskihkiy/They knew both sidesof medicine: Cree tales of curing and cursing toldby Alice Ahenakew, edited by HC Wolfart and FAhenakew. Publications of the Algonquian TextSociety. Winnipeg: University of Manitoba Press.

Antti Arppe, Jordan Lachler, Trond Trosterud, LeneAntonsen, and Sjur N Moshagen. 2016. Basic lan-guage resource kits for endangered languages: Acase study of plains Cree. CCURL, pages 1–8.

Antti Arppe, Katherine Schmirler, Atticus G. Harrigan,and Arok Wolvengrey. forthc. A morphosyntacti-cally tagged corpus for Plains Cree. In Papers of the49th Algonquian Conference.

Chris Bagwell. 1998–2013. Sox – sound exchange, theswiss army knife of audio manipulation. http://sox.sourceforge.net/sox.html.

Glecia Bear, Minnie Fraser, Irene Calliou, MaryWells, Alpha Lafond, and Rosa Longneck. 1992.kohkominawak otacimowiniwawa/Our Grandmoth-ers’ Lives: As Told In Their Own Words, edited by FAhenakew and HC Wolfart, volume 3. University ofRegina Press.

Christian Benoıt, Martine Grice, and Valerie Hazan.1996. The SUS test: A method for the assessment oftext-to-speech synthesis intelligibility using seman-tically unpredictable sentences. Speech Communi-cation, 18:381 – 392.

Alan W. Black, Kevin A. Lenzo, and Vincent Pagel.1998. Issues in building general letter to soundrules. In The Third ESCA/COCOSDA Workshopon Speech Synthesis, Blue Mountains, Australia,November 26-29, 1998, pages 77–80.

Leonard Bloomfield. 1946. Algonquian. In Linguis-tic structures of Native America, volume 6, pages85–129. Viking Fund Publications in Anthropology,New York.

Canadian Bible Society. 2005. kihci-masinahikannikamowina: atiht ka-nawasonamihk. Toronto:Canadian Bible Society.

Jonathan Duddington, Martin Avison, Reece Dunn, andValdis Vitolins. n.d. eSpeak speech synthesizer.

Ethnologue. 2016. Plains Cree.

Atticus Harrigan and Benjamin Tucker. 2015. Vowelsspaces and reduction in Plains Cree. Journal of theCanadian Acoustics Association, 43(3).

Atticus G. Harrigan, Katherine Schmirler, Antti Arppe,Lene Antonsen, Trond Trosterud, and Arok Wolven-grey. 2017. Learning from the computational mod-elling of Plains Cree verbs. Morphology, 27(4):565–598.

Florian Hinterleitner. 2017. Quality of SyntheticSpeech: Perceptual Dimensions, Influencing Fac-tors, and Instrumental Assessment, 1st edition.Springer Publishing Company, Incorporated.

Ute Jekosch. 1992. The cluster-identification test. InCSLP-1992.

Dan Jurafsky and James H. Martin. 2009. Speech andlanguage processing : An introduction to naturallanguage processing, computational linguistics, andspeech recognition. Pearson Prentice Hall, UpperSaddle River, N.J.

Jim Ka-Nıpitehtew. 1998. Counselling Speeches of JimKa-Nıpitehtew, edited by F Ahenakew and HC Wol-fart. University of Manitoba Press.

Nancy LeClaire, George Cardinal, Emily Hunter, andEarle G. Waugh. 1998. Alberta Elders’ Cree Dic-tionary: Alperta ohci kehtehayak nehiyaw otwes-tamakewasinahikan, 1st edition. Edmonton: Uni-versity of Alberta Press.

Yoshitaka Mamiya, Junichi Yamagishi, Oliver Watts,Robert A.J. Clark, Simon King, and Adriana Stan.2013. Lightly Supervised GMM VAD to use Audio-book for Speech Synthesiser. In Proc. ICASSP.

Maskwachees Cultural College. 1997. MaskwacısDictionary of Cree Words/Nehiyaw Pıkiskwewinisa.Maskwachees Cultural College.

William Mason and Sophia Mason. 2000. Oski Testa-ment. Toronto: Canadian Bible Society.

Cecilia Masuskapoe. 2010. piko kıkway e-nakacihtat:kekek otacimowina e-nehiyawasteki, edited by HCWolfart and F Ahenakew. Algonquian and IroquoianLinguistics.

Emma Minde. 1997. kwayask e-kı-pe-kiskinowapahtihicik/Their Example Showed Methe Way, edited by F Ahenakew and HC Wolfart.Edmonton: University of Alberta Press.

72

Jeffrey Muehlbauer. 2012. Vowel Spaces in PlainsCree. Journal of the International Phonetic Asso-ciation, 42(1):91–105.

Jean Okimasis and Arok Wolvengrey. 2008. How toSpell it in Cree. Regina : Miywasin Ink.

Aaron van den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alexander Graves,Nal Kalchbrenner, Andrew Senior, and KorayKavukcuoglu. 2016. Wavenet: A generative modelfor raw audio. In Arxiv.

Kishore Prahallad, Alan W. Black, and Mosur Ravis-hankar. 2006. Sub-phonetic modeling for capturingpronunciation variations for conversational speechsynthesis. In 2006 IEEE International Conferenceon Acoustics Speech and Signal Processing, ICASSP2006, Toulouse, France, May 14-19, 2006, pages853–856.

Simple4All. 2011–2014. Simple4all: developing au-tomatic speech synthesis technology. http://simple4all.org.

Simple4All. 2013-2014. Ossian speech synthesistoolkit. http://homepages.inf.ed.ac.uk/owatts/ossian/html/index.html.

Simple4All. 2014. Alisa: An automatic lightlysupervised speech segmentation and alignmenttool. http://simple4all.org/product/alisa/.

A. Stan, P. Bell, J. Yamagishi, and S. King.2013. Lightly supervised discriminative training ofgrapheme models for improved sentence-level align-ment of speech and text data. In Proceedings ofthe Annual Conference of the International SpeechCommunication Association, INTERSPEECH, IN-TERSPEECH 2013 - 14th Annual Conference ofthe International Speech Communication Associa-tion, pages 1525–1529, (1)Communications Depart-ment, Technical University of Cluj-Napoca.

Peter Vandall and Joe Douquette. 1987.waskahikaniwiyiniw-acimowina/Stories of theHouse People, edited by F Ahenakew. University ofManitoba Press.

Sarah Whitecalf. 1993. kinehiyawiwininawnehiyawewin/The Cree language is our identity:The La Ronge lectures of Sarah Whitecalf, editedand translated by HC Wolfart and F Ahenakew.Publications of the Algonquian Text Society /Collection de la Societe d’edition des textes al-gonquiennes. Winnipeg: University of ManitobaPress.

Robert Whitman, Chilin Shih, and Richard Sproat.1997. A navajo language text-to-speech synthesizer.Technical report, AT&T Bell Laboratories.

H. Christoph Wolfart. 1973. Plains Cree: a gram-matical study. Transactions of the American Philo-sophical Society: new ser., v. 63, pt. 5. Philadelphia,American Philosophical Society, 1973.

H. Christoph Wolfart. 1996. Sketch of Cree, an Algo-nquian Language. In Handbook of American Indi-ans. Volume 17: Languages, volume 17, pages 390–439. Smithsonian Institute, Washington.

Zhizheng Wu, Oliver Watts, and Simon King. 2016.Merlin: An open source neural network speech syn-thesis system. In 9th ISCA Speech Synthesis Work-shop (2016), pages 218–223.

73


A biscriptual morphological transducer for Crimean TatarFrancis M. Tyers

Department of LinguisticsIndiana University

[email protected]

Jonathan North WashingtonLinguistics DepartmentSwarthmore College

[email protected]

Darya KavitskayaSlavic Languages and Literatures

U.C. [email protected]

Memduh GökırmakÚFAL

Univerzita Karlova in [email protected]

Nick HowellFaculty of Mathematics

Higher School of [email protected]

Remziye BerberovaDepartment of Crimean Tatar∗Crimean Tavrida [email protected]

Abstract

This paper describes a weighted finite-statemorphological transducer for Crimean Tatarable to analyse and generate in both Latin andCyrillic orthographies. This transducer was de-veloped by a team including a community mem-ber and language expert, a field linguist whoworks with the community, a Turkologist withcomputational linguistics expertise, and an ex-perienced computational linguist with Turkicexpertise.Dealing with two orthographic systems in thesame transducer is challenging as they employdifferent strategies to deal with the spelling ofloan words and encode the full range of the lan-guage’s phonemes and their interaction. We de-velop the core transducer using the Latin or-thography and then design a separate translit-eration transducer to map the surface forms toCyrillic. To help control the non-determinismin the orthographic mapping, we use weights toprioritise forms seen in the corpus. We performan evaluation of all components of the system,finding an accuracy above 90% for morphologi-cal analysis and near 90% for orthographic con-version. This comprises the state of the art forCrimean Tatar morphological modelling, and,to our knowledge, is the first biscriptual singlemorphological transducer for any language.

1 IntroductionThis paper presents the development and evalua-tion of a free/open-source finite-state morphologi-cal transducer for Crimean Tatar that is able to anal-yse and generate both Latin and Cyrillic orthogra-phies.¹ As an example, this transducer can mapbetween the analysis köy<n><px3sp><loc> ‘in thevillage of’ and both of the possible orthographicforms, in Latin köyünde and in Cyrillic коюнде.

†Emerita¹The transducer is available in a publicly accessible reposi-

tory at http://github.com/apertium/apertium-crh.

This paper is structured as follows: Section 2presents background on Crimean Tatar and its or-thographic systems; Section 3 gives an overviewof the current state of computational analysis ofCrimean Tatar morphology; Section 4 presents cer-tain challenging aspects of Crimean Tatar morphol-ogy and their implementation; and Section 5 evalu-ates the transducer. Finally, Section 6 summarisespossible directions for future work, and Section 7provides concluding remarks.

2 Crimean Tatar2.1 ContextCrimean Tatar ([qɯɾɯmtɑtɑɾʧɑ], ISO 639-3: crh)is an understudied Turkic language of the North-western Turkic (Kypchak) subgroup (Johanson andCsató, 1998).² The language also shows a con-siderable influence from the Southwestern Turkic(Oghuz) subgroup, acquired via contact with Turk-ish, as well as more recent influence from theSoutheastern Turkic subgroup, due to the nearly5-decade-long resettlement of the entire CrimeanTatar population of Crimea to Central Asia (pre-dominantly to Uzbekistan) by the Soviet govern-ment in 1944. It shares some level of mutual intelli-gibility with other languages in these subgroups, butis an independent language. The geographical lo-cation of Crimean Tatar in reference to other NW(Kypchak) and SW (Oghuz) varieties is shown inFigure 1.Currently, about 228,000 speakers of Crimean

Tatar have returned to Crimea, and another 313,000live in diaspora (Simons and Fennig, 2018). Almostall speakers of Crimean Tatar are bilingual or mul-tilingual in Russian and the language of the placeof their exile, such as Uzbek, Kazakh, or Tajik.

²Crimean Tatar and [Kazan] Tatar (tat) happen to sharea name, but are only related in that they are both North-western Turkic languages—though members of different sub-subgroups.

74

Figure 1: Location of the Crimean Tatar speaking area(crh) within the Black Sea region, relative to otherKypchak (Urum – uum, Karachay-Balkar – krc, Nogay– nog, Kumyk – kum and Kazakh – kaz) and Oghuz lan-guages (Gagauz – gag, Turkish – tur, and Azerbaijani– azb and azj).

Crimean Tatar is spoken mostly by the older pop-ulation, but the usage may be rising, due to the in-creasing efforts in the community to teach the lan-guage to the younger generation, despite impedancecaused by the generally unfavourable sociolinguisticsituation.

2.2 Orthographic systemsAs previously mentioned, Crimean Tatar can bewritten with two orthographic systems, one basedon the Latin alphabet and one based on the Cyrillicalphabet. Both orthographies are widely used andhave varying degrees of official support. They havedifferent ways of treating phenomena such as frontrounded vowels, the uvular/velar contrast in obstru-ents, and loanwords from Russian.The Latin orthography contains 31 characters,

and uses no digraphs except for foreign sounds inborrowings. Each phoneme is implemented usinga distinct character; for example ‹o› for the non-high back rounded vowel and ‹ö› for its front coun-terpart. The diacritics tilde (‹ñ› for [ŋ]), cedilla(e.g., ‹ş› for [ʃ]) and diaeresis (e.g., ‹ö› for [ø]) areused as components of characters for sounds thatare not covered in a straightforward way by the ba-sic Latin alphabet. In the Latin orthography, Rus-sian words can be treated as adapted to the phonol-ogy of Crimean Tatar, such as bücet ‘budget’—asopposed to *byudjet, a more faithful rendering inthe Latin orthography of the pronunciation of theRussian бюджет. However, most loanwords arepronounced as in Russian, yet are rendered morefaithfully to their Russian spelling than to their Rus-sian pronunciation; for example, konki for Rus-

sian коньки [kɐnʲˈkʲi] ‘skates’ is pronounced as inRussian and not *[konki] as its Latin-orthographyspelling would suggest.The Cyrillic orthography contains the 33 charac-

ters of the Russian alphabet and four diagraphs forsounds not found in Russian: ‹дж› [ʤ], ‹гъ› [ʁ], ‹къ›[q], and ‹нъ› [ŋ]. The additional sounds [ø] and [y]are implemented by either the use of the “soft sign”‹ь› in conjunction with the letters for [o] ‹о› and [u]‹у› or by using the corresponding “yoticised” vowelletters ‹ё› and ‹ю›, respectively; the particular sys-tem is clarified below. Russian words are spelled asin Russian, including the use of hard and soft signsand Russian phonological patterns. An example isконьки ‘skates’, in which the ‹ь› indicates that the[n] is palatalised.Table 1 shows the basic mapping between the two

orthographic systems.

b c ç d f g ğ h j k lб дж ч д ф г гъ х ж к л

m n ñ p q r s t v y zм н нъ п къ р с т в й з

a â ı o u e i ö üа, я я ы о, ё у, ю э, е и о, ё у, ю

Table 1: The basic correspondences between the charac-ters of the two Crimean Tatar orthographies.

While the Latin orthography represents the frontvowels [ø] and [y] simply as ‹ö› and ‹ü›, in the Cyril-lic orthography they are represented in one of thefollowing ways.With the letters ‹о› and ‹у›:

• With a front-vowel character (‹и›, ‹е›) or one ofthe “yoticised” vowels (‹ё›, ‹ю›) following thesubsequent consonant. This strategy is usuallylimited to the first syllable of a [multisyllable]word when in certain consonant contexts, andis more prevalent in open syllables. Examplesinclude учюнджи [yʧ-ynʤi] ‘third’, кунюнде[kyn-yn-de] ‘on the day of’, болип [bøl-ip]‘having divided’, and муче [myʧe] ‘body part’.

• Either with the “soft sign” ‹ь› following the sub-sequent consonant or without if a “soft conso-nant” (i.e., velar pair to a uvular consonant),like [k] or [ɡ], follows the vowel. This strat-egy is generally used only in the first sylla-ble of a word when there is a following codaconsonant. Examples include учь [yʧ] ‘three’,

75

куньлери [kyn-ler-i] ‘days of’, бoльмеди [bøl-me-di] ‘did not divide’, кок [køk] ‘sky’, andбукти [byk-ti] ‘he/she/it bent’.

With the “yoticised” vowel letters ‹ё› and ‹ю›:

• When not in the first syllable of a word.Examples include юзю [jyz-y] ‘his/her face’,тёксюн [tøk-syn] ‘let it spill’, and мумкюн[mymkyn] ‘possible’. Note that [ø] almostnever occurs outside of the first syllable of aword.

• When in the first syllable of a word and pre-ceded by a consonant. Examples includeтюшти [tyʃ-ti] ‘fell’, дёрт [dørt] ‘four’, andчёпке [ʧøp-ke] ‘to the rubbish’.

The difference between when the letters ‹о› or ‹у›are used as opposed to ‹ё› or ‹ю› in first syllablesof words seems to in part depend on the values ofsurrounding consonants. However, a certain level ofidiosyncrasy exists, as seen in pairs like козь [køz]‘eye’ and сёз [søz] ‘word’, кёр [kør] ‘blind’ and корь[kør] ‘see’, orюз [jyz] ‘hundred’ andюзь [jyz] ‘face’.The “yoticised” vowel letters ‹ё› and ‹ю› also rep-

resent the vowels [o] and [u] when following [j] aswell as [ø] and [y] when following [j]—sound com-binations that can occur word-initially or after an-other vowel (usually rounded and of correspondingbackness, though note that [ø] and [o] are extremelyuncommon outside of the first syllable of a word).Hence there is in principle the potential for system-atic ambiguity between rounded back vowels pre-ceded by [j] (‹ё› [jo] and ‹ю› [ju]) and rounded frontvowels preceded by [j] (‹ё› [ø] and ‹ю› [y]). In prac-tice it is difficult to identify examples of this, butpairs like ют [jut] ‘swallow’ and юз [jyz] ‘hundred’demonstrate the concept.Furthermore, [j] is represented in the Cyrillic or-

thography either by й (e.g., къой [qoj] [put]), orwith a yoticised vowel letter (e.g., къоюл [qoj-ul]‘be put’); i.e., these letters are involved in a many-to-many mapping with the phonology.

3 Prior workAltıntaş and Çiçekli (2001) present a finite-statemorphological analyser for Crimean Tatar. Theirmorphological analyser has a total of 5,200 stemsand the morphotactics³ are based on a morphologi-cal analyser of Turkish. They explicitly cover only

³The morphotactics of a language is the way in which mor-phemes can be combined to create words.

the native part of the vocabulary, excluding loanwords, and use an ASCII representation for the or-thography. Their analyser is not freely available fortesting so unfortunately we could not compare itsperformance to that of ours.

4 Methodology

To implement the transducer we used the HelsinkiFinite-State Toolkit, HFST (Lindén et al., 2011).This toolkit implements the lexc and twol for-malisms and also natively supports weighted FSTs.The former implements morphology, or mappingsbetween analysis and morphological form, such asköy<n><px3sp><loc> : köy>{s}{I}{n}>{D}{A},while the latter is used to ensure the correctmapping between morphological form and or-thographic (or “phonological”) form, such asköy>{s}{I}{n}>{D}{A} : köyünde. When com-pose-intersected, the transducers generated fromthese modules result in a single transducer map-ping the two ends with no intermediate form,e.g., köy<n><px3sp><loc> : köyünde.The choice to model the morphophonology us-

ing twol as opposed to using a formalism thatimplements sequential rewrite rules may be seenas controversial. Two-level phonological rules areequivalent in expressive power to sequential rewriterules (Karttunen, 1993); however, from the point ofview of linguistics, they present some differences interms of how phonology is conceptualised. Two-level rules are viewed as constraints over a set ofall possible surface forms generated by expandingthe underlying forms using the alphabet, and oper-ate in parallel. Sequential rewrite rules, on the otherhand, are viewed as a sequence of operations forconverting an underlying form to a surface form. Assuch, sequential rules result in intermediate forms,whereas the only levels of representation relevantto two-level rules are the morphological (under-lying) form and the phonological (surface) form.While it may not be relevant from an engineeringpoint of view, we find more cognitive plausibilityin the two-level approach. Furthermore, the com-putational phonologist on our team finds the two-level model much less cumbersome to work withfor modelling an entire language’s phonology thansequential rewrite rules. Readers are encouraged toreview Karttunen (1993) for a more thorough com-parison of the techniques.

76

4.1 LexiconThe lexicon was compiled semi-automatically. Weadded words to the lexicon by frequency, basedon frequency lists from Crimean Tatar Bible⁴ andWikipedia⁵ corpora (see Section 5.1 for sizes). Ta-ble 2 gives the number of lexical items for each ofthe major parts of speech. The proper noun lexi-con includes a list of toponyms extracted from theCrimean Tatar Wikipedia.

Part of speech Number of stemsNoun 6,271Proper noun 4,123Adjective 1,438Verb 1,007Adverb 87Numeral 40Pronoun 31Postposition 21Conjunction 20Determiner 16Total: 13,054

Table 2: Number of stems in each of themain categories.

4.2 TagsetsThe native tagset of the analyser is based on thetagsets used by other Turkic-language transducers⁶in the Apertium platform.⁷ In addition we providea mapping from this tagset to one compatible withUniversal Dependencies (Nivre et al., 2016) basedon 125 rules and a set overlap algorithm.⁸ The rulesare composed of triples of (lemma, part of speech,features) and are applied deterministically longest-overlap first over the source language analyses.

4.3 MorphotacticsThe morphotactics of the transducer are adaptedfrom those of the Kazakh transducer described byWashington et al. (2014). The nominal morpho-tactics are almost identical between Kazakh andCrimean Tatar. The verbal morphotactics are ratherdifferent, and we here followed Kavitskaya (2010).

⁴Compiled by IBT Russia/CIS, https://ibt.org.ru⁵Content dump from theCrimean TatarWikipedia, https:

//crh.wikipedia.org, dated 2018-12-01.⁶See for exampleWashington et al. (2016) for a description.⁷Available online at http://www.apertium.org.⁸Available in the repository as texts/crh-feats.tsv.

4.4 TransliteratorThe transliterator is implemented as a separatesubstring-to-substring lexc grammar and twolruleset.The lexc grammar defines a transducer which

converts from the Latin orthography to a stringwhere placeholders are given for the hard sign, softsign, and some digraphs which are single charactersin the Cyrillic orthography (e.g., ts = ц). The out-put may be ambiguous, for example the input stringşç produces both şç (analysis, preceding translitera-tion) and a special symbol щ (also analysis) standingfor Cyrillic щ (surface form). This is necessary be-cause şç may map to either щ (borşç = борщ [borɕ]‘borsch’) or ‹шч› (işçi = ишчи [iʃʧi] ‘worker’).The twol ruleset defines a transducer which then

maps the Latin string produced to strings in Cyril-lic via the alphabet, and applies a set of constraintsto restrict the possible combinations. All remainingtheoretically valid mappings are kept. An exampleof one of these constraints is shown in Figure 2.

”e as э”e:э <=> .#. _ ;

[ e: | a: | i: | ü: ] _ ;

Figure 2: An example of a twol constraint used in themapping of Latin strings to Cyrillic strings. This con-straint forces Latin ‹e› [e] to be realised as Cyrillic ‹э› in-stead of Cyrillic ‹е› (which would in turn stand for [je])at the beginning of the word and after certain vowels.

The resulting transducer takes surface forms inLatin script, and outputs surface forms in Cyrillicscript. In order to get an analyser which analy-ses Cyrillic, we then composed the original [Latin]transducer with the transliteration transducer.In order to be able to choose the orthographically

correct variant for generation in the case of ambi-guity in the conversion, we tried two corpus-basedmethods.The first method we tried was simply to weight

surface forms we saw in the corpus with a nega-tive weight; since our transducer interprets lowerweights as better, forms which were previously seenwould always be given preference over those gen-erated by the model on the fly. This was done bymaking a negative-weight identity transducer of thesurface forms, composing with the transliterationtransducer, and taking the union with the translit-eration transducer alone.The second method was to estimate probabilities

for the generated transliterations using character n-

77

gram frequencies from the corpus. We used a par-ticularly simplistic estimation technique: fix n ≥ 1.We collect k-grams for k ≤ n, the collections ofsize nk with redundancy. The probability assignedto the k-gram x is #x/nk. Weights are chosen tobe the negative logarithm of the probability. Theunaugmented transliteration transducer is then com-posed with this new weighted transducer before be-ing composed with the morphological transducer.Transducers produced by the two methods are

unioned; the result is then composed with the mor-phological transducer. Generated transliterationsthus have many paths which they can follow to ac-ceptance, assigning various weights. If the translit-eration is a surface form observed in the trainingcorpus, it is assigned a negative weight (given ab-solute preference). Unobserved transliterations areassigned positive weights based on possible segmen-tations; k-character segments are assigned weightsfrom the k-gram counters, and the weight of a seg-mentation is the sum of the weights of the segments.

4.5 Error model

While working on morphological models for lan-guages with a less-than-stable written norm, orwhere there is little support for proofing tools orkeyboards, it is desirable to be generous with whatforms are accepted while being conservative withwhat forms are generated. Orthographic variation isinevitable, and if we want to create a high coverageresource, then we should also take this variation intoaccount. For example, in the corpus the locative ofBelarus (the country) is written Belarusde 4 times,Belarusta twice and Belaruste 1 time. The norma-tive spelling, to fit with the pronunciation [belarus]should be Belarusta; however, we would also like tobe able to provide an analysis for the other variants.Based on an informal examination of 1, 000 randomtokens in the news and encyclopaedic corpora, weestimate that at least 0.8% of tokens in these cor-pora constitute non-normative orthographic formsof this type.Our approach again is to use twol. First the

rules for vowel harmony and other phonologicalphenomena were removed from the twol trans-ducer that implements the normative orthography,leaving only unconstrained symbol mappings. Thiswas then composed with the lexicon to producea transducer which has all of the possible phono-logical variants (much like a fully expanded ver-sion of the lexc transducer). This was then sub-

tracted from the normative transducer and a tag<err_orth> was appended to the analysis side ofall remaining forms to indicate orthographic error,This was output into a transducer which acceptsany phonological variant that was not normativeand give an analysis with an extra tag. This wasunioned with the normative transducer to producethe final analyser. This approach allows us to anal-yse prescriptively incorrect variants like Belaruste asBelarus<np><top><loc><err_orth>.

5 EvaluationWe have evaluated the morphological transducer inseveral ways. We computed the naïve coverage andthe mean ambiguity of the analyser on freely avail-able corpora (Section 5.1) as well as its accuracy(precision and recall) against a gold standard (Sec-tion 5.2). Additionally, we evaluated the accuracyof the transliteration transducer (Section 5.3).

5.1 Analyser coverageWe determined the naïve coverage and mean am-biguity of the morphological analyser. Naïve cov-erage is the percentage of surface forms in a givencorpus that receive at least one morphological anal-ysis. Forms counted by this measure may have otheranalyses which are not delivered by the transducer.The mean ambiguity measure was calculated as theaverage number of analyses returned per token inthe corpus. These measures for three corpora, span-ning both orthographies, are presented in Table 3.⁹

Corp. Orthog. Tokens Cov. Ambig.Bible Cyr 217,611 90.9% 1.86Wiki Lat 214,099 92.1% 1.86News Lat 1,713,201 93.7% 2.12

Table 3: Naïve coverage andmean ambiguity of the anal-yser on three corpora.

The transducer provides analyses for over 90%oftokens in each corpus, with each token receiving anaverage of around two analyses.

5.2 Transducer accuracyPrecision and recall are measures of the averageaccuracy of analyses provided by a morphologicaltransducer. Precision represents the number of the

⁹The Bible and Wikipedia corpora are those described inSection 4.1, and the News corpus is content for the years 2014–2015 extracted from http://ktat.krymr.com/.

78

analyses provided by the transducer for a form thatare correct. Recall is the percentage of analyses thatare deemed correct for a form that are provided bythe transducer.To calculate precision and recall, it was necessary

to create a hand-verified list of surface forms andtheir analyses. We extracted around 2,000 uniquesurface forms at random from the Wikipedia cor-pus, and checked that they were valid words in thelanguages and correctly spelled. When a word wasincorrectly spelled or deemed not to be a form usedin the language, it was discarded.¹⁰This list of surface forms was then analysed with

the most recent version of the analyser, and around500 of these analyses were manually checked.Where an analysis was erroneous, it was removed;where an analysis was missing, it was added. Thisprocess gave us a ‘gold standard’ morphologicallyanalysed word list of 448 forms.¹¹We then took the same list of surface forms and

ran them through the morphological analyser oncemore. Precision was calculated as the number ofanalyses which were found in both the output fromthe morphological analyser and the gold standard,divided by the total number of analyses output bythe morphological analyser. Recall was calculatedas the total number of analyses found in both theoutput from themorphological analyser and the goldstandard, divided by the number of analyses foundin the morphological analyser plus the number ofanalyses found in the gold standard but not in themorphological analyser.After comparing with the gold-standard in this

way, precision was 94.98% and recall was 81.32%.Most of the issues with recall were due to miss-

ing stems in the lexicon, primarily nouns and propernouns.¹² Regarding the precision, common issuesincluded incorrect categorisation in the lexicon, anddubious forms, such as the imperative of kerek-‘need’, which is in the analyser and is a hypothet-ically possible form, but appears not to be possiblein practice.

5.3 Transliterator accuracyWe evaluated the transliteration component on theheadwords from a bilingual Crimean Tatar–Russiandictionary that has been published in both Cyril-

¹⁰Available in the repository as dev/annotate.txt.¹¹Available in the repository as dev/annotate.all.txt.¹²However, note that the recall number may be somewhat

inflated, as thinking of missing analyses for already analysedwords is particularly difficult.

lic and Latin orthographies.¹³ We created a listof Cyrillic–Latin correspondences by aligning theheadwords automatically based on an automatedword-by-word comparison of the definitions inRussian, for a total of 14, 905 unique entries.

141 entries (~1%) had comments which did notmatch word-for-word; while it is possible that thesecould be corrected by hand, we discarded them. Wethen fed the Latin entries to the full transliterationtransducer and evaluated against the correspondingCyrillic entry.Table 4 shows the performance of the transliter-

ator for the different methods. In this case, preci-sion is the percentage of predictions which are cor-rect, recall is the percentage of words for whicha correct transliteration is predicted, and the F -score is the harmonic mean of the two: F−1 =mean(Prec−1,Rec−1).

Method States Precision Recall F-score– 114 53.0 98.4 68.91-gram 2,030 93.4 93.5 93.52-gram 17,382 94.1 94.2 94.13-gram 99,761 94.0 94.1 94.14-gram 290,201 94.4 94.6 94.55-gram 577,926 95.1 95.2 95.26-gram 924,719 95.5 95.6 95.57-gram 1,282,917 95.4 95.6 95.0

Table 4: Performance of the transliterator using differ-ent methods. States gives a measure of the size of thegenerated FST.

Withoutn-grams, there is no attempt to filter pro-posed transliterations; that is, this “null” methodgenerates all possible transliterations according tothe combined phonological-morphological trans-ducer. It demonstrates the theoretical limit of re-call. Precision dramatically increases with the in-troduction of n-grams, as expected. Precision in-creases with more n-grams, levelling off at just over95%. Recall drops from the maximum of 98.4%(the theoretical maximum the n-gram system canhope to attain); as the quality of the statistical filterincreases, so does recall, until it levels off at 95.6%.The problems with the transliteration model con-

sist almost entirely of issues related to the presenceof hard and soft signs in Cyrillic spellings (account-ing for 492 of 1007, or 48.9%, of errors), incorrectvowels, mostly related to yoticisation (accountingfor 469, or 46.6%, of errors), and issues correctly

¹³Available from http://medeniye.org/node/984.

79

predicting “ц” versus “тс” (accounting for 40, or 4%of errors). These errors typically arise in loanwords,where the correct Cyrillic spelling is often impossi-ble to predict from the Latin orthography. Accuracyregarding these issues could likely be improved byhaving a larger and more representative corpus ofCrimean Tatar in Cyrillic with which to train the n-gram models, or by attempting to model the loan-word system.

6 Future work

The performance of the n-grammodel could be im-proved by modelling the predictability of the or-thography in n-grams and with a sliding windowto filter out unlikely concatenations of common n-grams.Aside from expanding the lexicon, the trans-

ducer forms part of a machine translation sys-tem from Crimean Tatar to Turkish being devel-oped in the Apertium platform. There is alsothe prospect of applying it to dependency pars-ing for Crimean Tatar, and there have been somepreliminary experiments in this direction (Ageevaand Tyers, 2016). We would also like to applythe approach for dealing with multiple scripts toother Turkic languages, such as Uzbek, Kazakh, orKarakalpak, where more than one widely-used nor-mative orthography is in use. An additional advan-tage of our approach is that when orthographic sys-tems are replaced, as is currently occurring in Kaza-khstan for Kazakh, there is no need to completelyrewrite an existing mature transducer; instead, asupplemental transliteration transducer can be con-structed.

7 Concluding remarks

The primary contributions of this paper are a wide-coverage morphological description of CrimeanTatar able to analyse and generate both Cyrillicand Latin orthographies, and a general approach tobuilding biscriptual transducers.

Acknowledgements

This work was made possible in part by Google’sSummer of Code program, the Swarthmore CollegeResearch Fund, and the involvement of the Aper-tium open source community. We also very muchappreciate the thoughtful, constructive commentsfrom several anonymous reviewers.

ReferencesEkaterina Ageeva and Francis M. Tyers. 2016. Com-bined morphological and syntactic disambiguation forcross-lingual dependency parsing. In Proceedings ofTurkLang 2016, Bishkek, Kyrgyzstan.

Kemal Altıntaş and İlyas Çiçekli. 2001. A morpholog-ical analyser for Crimean Tatar. In Turkish Sympo-sium on Artificial Intelligence and Neural Networks(TAINN2001).

Lars Johanson and Éva Á. Csató. 1998. The Turkic Lan-guages. Routledge.

Lauri Karttunen. 1993. The Last Phonological Rule: Re-flections on constraints and derivations, chapter Finite-state constraints. University of Chicago Press.

Darya Kavitskaya. 2010. Crimean Tatar. LINCOM Eu-ropa.

Krister Lindén, Miikka Silfverberg, Erik Axelson, SamHardwick, and Tommi Pirinen. 2011. HFST—Framework for Compiling and Applying Morpholo-gies, volume 100 of Communications in Computer andInformation Science. Springer.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajič, Chris Manning, RyanMcDonald, Slav Petrov, Sampo Pyysalo, Natalia Sil-veira, Reut Tsarfaty, and Dan Zeman. 2016. Univer-sal Dependencies v1: A Multilingual Treebank Col-lection. In Proceedings of Language Resources andEvaluation Conference (LREC’16).

Gary F. Simons and Charles D. Fennig, editors. 2018.Twenty-first edition edition. SIL International, Dallas,Texas. Crimean Tatar: https://www.ethnologue.com/language/crh.

Jonathan North Washington, Aziyana Bayyr-ool, AelitaSalchak, and Francis M. Tyers. 2016. Developmentof a finite-state model for morphological processingof Tuvan. Родной Язык, 1(4):156–187.

Jonathan North Washington, Ilnar Salimzyanov, andFrancis M. Tyers. 2014. Finite-state morphologicaltransducers for three Kypchak languages. In Proceed-ings of the 9th Conference on Language Resources andEvaluation, LREC2014.

80


Improving Low-Resource Morphological Learning with IntermediateForms from Finite State Transducers

Sarah Moeller and Ghazaleh Kazeminejad and Andrew Cowell and Mans HuldenDepartment of Linguistics


Abstract

Neural encoder-decoder models are usuallyapplied to morphology learning as an end-to-end process without considering the underly-ing phonological representations that linguistsposit as abstract forms before morphophono-logical rules are applied. Finite State Trans-ducers for morphology, on the other hand, aredeveloped to contain these underlying formsas an intermediate representation. This pa-per shows that training a bidirectional two-stepencoder-decoder model of Arapaho verbs tolearn two separate mappings between tags andabstract morphemes and morphemes and sur-face allomorphs improves results when train-ing data is limited to 10,000 to 30,000 exam-ples of inflected word forms.

1 Introduction

A morphological analyzer is a prerequisite formany NLP tasks. A successful morphological an-alyzer supports applications such as speech recog-nition and machine translation that could providespeakers of low-resource languages access to on-line dictionaries or tools similar to Siri or GoogleTranslate and might support and accelerate lan-guage revitalization efforts. This is even morecrucial for morphologically complex languagessuch as Arapaho, an Algonquian language indige-nous to the western USA. In polysynthetic lan-guages such as Arapaho, inflected verbal formsare often semantically equivalent to whole sen-tences in morphologically simpler languages. Astandard linguistic model of morphophonologyholds that multiple morphemes are concatenatedtogether and then phonological rules are appliedto produce the inflected forms. The operation ofphonological rules can reshape the string of fixedmorphemes considerably, making it difficult forlearners, whether human or machines, to recreatecorrect forms (generation) from the morpheme se-

quence or to analyze the reshaped inflected formsinto their individual morphemes (parsing).

In this paper we describe an experiment intraining a neural encoder-decoder model to repli-cate the bidirectional behavior of an existing finitestate morphological analyzer for the Arapaho verb(Kazeminejad et al., 2017). When a language islow-resource, natural language processing needsstrategies that achieve usable results with less data.We attempt to replicate a low-resource context byusing a limited number of training examples. Weevaluate the feasibility of learning abstract inter-mediate forms to achieve better results on varioustraining set sizes. While common wisdom regard-ing neural models has it that, given enough data(Graves and Jaitly, 2014), end-to-end training isusually preferable to pipelined models, an argu-ment can be made that morphology is an excep-tion to this: learning two regular mappings sepa-rately may be easier than learning a single com-plex one. In Liu et al. (2018), adressing a relatedtask, noticeably better results were reached forGerman, Finnish, and Russian when a neural sys-tem was first tasked to learn morphosyntactic tagsthan when it was tasked to produce an inflectedform directly from uninflected forms and context.These three languages are morphologically com-plex or unpredictable, but marginally better resultswere achieved for the less complex languages.

2 Arapaho Verbs

Arapaho is a member of the Algonquian (andlarger Algic) language family; it is an aggluti-nating, polysynthetic language, with free wordorder (Cowell and Moss Sr, 2008). The lan-guage has a very complex verbal inflection sys-tem, with a number of typologically uncommonelements. A given verb stem is used eitherwith animate or inanimate subjects for intransi-

81

tive verbs (tei'eihi- ‘be strong.animate’ vs. tei'oo-‘be strong.inanimate’), and with animate or inan-imate objects for transitive verbs (noohow- ‘sees.o.’ vs. noohoot- ‘see s.t.’). For each of thesecategories, the pronominal affixes/inflections varyin form. For example, 2SG with intransitive, ani-mate subject is /-n/, while for transitive, inanimateobject it is /-ow/ (nih-tei'eihi-n ‘you were strong’vs. nih-noohoot-ow ‘you saw it’).

All stem types can occur in four different verbalorders, whose function is primarily modal. Theseverbal orders each use different pronominal af-fixes/inflections as well. Thus, with four differentverb stem types and four different verbal orders,there are a total of 16 different potential inflec-tional paradigms for any verbal root, though thereis some overlap in the paradigms, and not all stemforms are possible for all roots.

Arapaho also has a proximate/obviative sys-tem, which designates pragmatically more- andless-prominent participants. “Direction-of-action”markers included in inflections do not correspondto true pronominal affixes. Thus nih-noohow-oot‘more important 3SG saw less important 3S/PL’vs. nih-noohob-eit ‘less important 3SG/PL sawmore important 3S’. The elements -oo- and -ei-specify direction of action, not specific persons ornumbers of participants.

Arapaho has both progressive and regressivevowel harmony, operating on /i/ and /e/ respec-tively. This results in alternations in both theinflections themselves, and the final elementsof stems, such as noohow-un ‘see him!’ vs.niiteheib-in ‘help him!’, or nih-ni'eeneb-e3en ‘Iliked you’ vs. nih-ni'eenow-oot ‘he liked her’.

The Arapaho verb, then, is subject to com-plicated morphophonological processes. For ex-ample, the underlying form of the word ‘wesee you’ concatenates the transitive verb stemwith animate object (TA) noohow ‘see’ and the‘1PL.EXCL.SUBJ.2SG.OBJ’ suffix -een. Thisunderlying form undergoes significant transforma-tion after morphophonological rewrite rules areapplied. An initial change (IC) epenthesizes -enbefore the first vowel in the verb stem because itis a long vowel and because the verb is affirmativepresent. Then vowel harmony is at work, chang-ing n-en-oohow-een to n-on-oohow-een. Finally aconsonant mutation rule changes w to b, producingthe surface form nonoohobeen (cf. Figure 1).

3 Finite State Model

One of the clear successes in computational mod-eling of linguistic patterns has been finite statetransducer (FST) models for morphological anal-ysis and generation (Koskenniemi, 1983; Beesleyand Karttunen, 2003; Hulden, 2009; Linden et al.,2009). An FST is bidirectional, able to both parseinflected word forms and generate all possibleword forms for a given stem (Beesley and Kart-tunen, 2003). Given enough linguistic expertiseand time investment, FSTs provide the capabilityto analyze any well-formed word in a language.

The Arapaho FST model used in this paperwas constructed with the foma finite-state toolkit(Hulden, 2009). It used 18,559 verb stems takenfrom around 91,000 lines of natural discourse ina large transcribed and annotated spoken corpusof Arapaho, parts of which are publicly availablein the Endangered Languages Archive (ELAR).1.All possible basic inflections occur in the corpus.The FST produces over 450,000 inflected formsfrom the stems.

The FST is constructed in two parts, the first be-ing a specification of the lexicon and morphotac-tics using the finite-state lexicon compiler (lexc),a high-level declarative language for effective lex-icon creation, where concatenative morphologi-cal rules and morphological irregularities are ad-dressed (Karttunen, 1993). The first part pro-duces intermediate, abstract “underlying” forms.These forms concatenate the appropriate mor-phemes from the lexicon in the correct order, (e.g.noohoween in Figure 1) but are not well-formedwords in the language.

The second part of the FST implements themorphophonological and phonological rules of thelanguage using “rewrite rules”. These rules applythe appropriate phonological changes to the inter-mediate forms in specified contexts. Thus, in gen-eration, the inflected word is not merely a bundleof morphemes, but the completely correct wordform in accord with the morphophonological andphonological rules of the language. By compos-ing, in a particular order (specified in the grammarof the language), the FSTs resulting from theserewrite rules to the parsed forms, the result is a sin-gle FST able to both generate and parse as shownin Figure 1.

1https://elar.soas.ac.uk/Collection/MPI189644

82

[VERB][TA][ANIMATE-OBJECT][AFFIRMATIVE]

[PRESENT][IC]noohow[1PL-EXCL-SUBJ][2SG-OBJ]

lexc transducer

noohowen

morphophonological transducer

nonoohobeen

underlying representation with tags (parse)

intermediate representation

surface representation

noohow-een

nonoohob-een

Figure 1: An example of a parsed form with verb stemand morphosyntactic tags (top left) and inflected sur-face form (bottom left) for the Arapaho FST. The in-termediate underlying phonological forms (middle left)are accessible to the FST before/after applying mor-phophonological alternations.

4 Training the LSTM

Although an extensive finite state morphologicalanalyzer is an extremely useful resource, neuralmodels are much better able to analyze to un-seen forms than finite state machines are. How-ever, neural models are hampered in low-resourcecontexts by their data greediness. In order tosee whether this limitation could be addressed wesimulated training a neural model in low-resourcecontexts using output from the Arapaho FST.Since the currently strongest performing mod-els for morphological inflection (Cotterell et al.,2017; Kann and Schutze, 2016; Makarov et al.,2017) use an LSTM-based sequence-to-sequence(seq2seq) model (Sutskever et al., 2014), we fol-low this design in our work. We implement theseq2seq model with OpenNMT’s (Klein et al.,2018) default parameters of 2 layers for both theencoder and decoder, a hidden size of 500 for therecurrent unit, and a maximum batch size of 64.

Training corpora of various sizes are createdby randomly selecting examples of inflected wordforms and their corresponding intermediate andparsed forms from the bidirectional output of theArapaho FST. This results in triplets like in Fig-ure 1. The triplets are arranged into three pairs—inflected “surface” forms (SF) & intermediateforms (IF), IF & parsed forms (PF), and SF & PF.Re-using the pairs for both parsing and generationgives six data sets. For simplicity’s sake, since theprimary aim is to compare the two strategies’ per-formance and not to measure accuracy, forms withambiguous inflected forms, parses, or intermediateforms were filtered. Other experiments (Moelleret al., 2018) indicate that pre-processing the datato account for ambiguous forms would not greatly

affect accuracy.We treat the intermediate strategy of parsing

as a translation task of input character sequencesfrom the fully-inflected surface forms to an outputof character sequences of the intermediate forms,and then from the intermediate forms to a se-quence of morphosyntactic tags plus the charactersequences of the verbal root. Generation followsthe same model in the opposite direction.

Figure 2: An example from training/test sets. In pars-ing, surface forms (SF) predict intermediate forms (IF).The output trains another encoder-decoder to predictparsed forms (PF). Generation follows the same stepsbut proceeding from the PF instead.

The selected data is divided roughly in half.The first half serves as training and developmentand the second half as testing data in the firststep of the intermediate training strategy (SF⇔IFor PF⇔IF). In order to compare the two train-ing strategies, the output of this intermediate steptrains and tests the second step of the intermedi-ate strategy. The original second half also servesto train and test the direct strategy (SF-PF or PF-SF). Symbol prediction degrades at the end of a se-quence. Best results are achieved when each char-acter/tag sequence is doubled on its line for train-ing and testing (Moeller et al., 2018) and trimmedfor evaluation. So, for example, nonoohobeen be-comes nonoohobeennonoohobeen during trainingand testing but predicted symbols that exceed thelength of the original string are deleted and the firsthalf of the predicted string is evaluated against theoriginal string.

5 Experiment and Results

We compare two strategies to train a neural modelto generate inflected verbs from morphosyntac-tic tags with verb stem or to parse inflected verbforms. First, we train the neural model to learncorrect output forms directly from the parsed orinflected input. Second, we added an intermediatestep where the model first learns the mapping to

83

intermediate forms and, from there, the mappingto the correct parsed or inflected form. We mea-sure the final accuracy score and the average Lev-enshtein distance and compare the performance ofthe two strategies in generation and in parsing. Ac-curacy is measured as the fraction of correct gener-ated/parsed forms in the output compared to com-plete gold inflected or parsed forms.

5.1 Generation

15 20 25 30 35

0

20

40

60

80

100

training examples in thousands

accu

racy

PF-SFPF-IF-SF

Figure 3: Generation - accuracy scores

15 20 25 30 35

0

2

4

6

8

10

Training examples in thousands

Lev

ensh

tein

dist

ance

PF-SFPF-IF-SF

Figure 4: Generation - average Levenshtein distances

We trained a bidirectional LSTM encoder-decoder with attention (Bahdanau et al., 2015) togenerate Arapaho verbs using five training setswith approximately 14.5, 18, 27, 31.5, and 36thousand examples. The direct strategy trains onthe morphosyntactic tags and verb stem. Each tagoccurs in the same order as its corresponding mor-pheme appears in the intermediate form. Only

“direction-of-action” tags/morphemes come afterthe stem.

The accuracy scores in Figure 3 and the Lev-enshtein distance measures in 4 show that the in-termediate strategy performs better than the directstrategy in low-resource settings. Starting at about14,500 training examples, where the direct strat-egy produces barely any inflected forms (SF) cor-rectly, the intermediate strategy achieves nearly69% accuracy. As the training size approaches36,000, the advantage of the intermediate step islost. Indeed, the intermediate strategy begins toperform worse while the direct strategy continuesto improve. The intermediate strategy seems topeak at 30,000 training examples.

5.2 Parsing

10 20 30 40

20

40

60

80

training examples in thousands

accu

racy

SF-PFSF-IF-PF

Figure 5: Parsing - accuracy scores

10 20 30 400

5

10

15

Training examples in thousands

Lev

ensh

tein

dist

ance

SF-PFSF-IF-PF

Figure 6: Parsing - average Levenshtein distances

The parsing trend is less clear when comparedto morphological generation. We compare seven

84

training sets of approximately 10, 14.5, 18, 31.5,36, and 45 thousand examples. As in morpholog-ical generation, at the lowest data settings the in-termediate learning strategy is preferable to the di-rect strategy, though it has a less dramatic perfor-mance difference. The accuracy scores in Figure5 show that with 14,000 training examples, the in-termediate strategy performs only about 10 pointshigher. The advantage of the intermediate strategyis less noticeable in parsing, nor does its advan-tage decrease as quickly. With 36,000 examplesbarely one point separates the two strategies andthe intermediate strategy performs slightly better.The intermediate strategy performance does notbegin to reduce until 45,000 examples. The av-erage Levenshtein distances in Figure 6, however,show that the direct strategy improves more con-sistently, though it is still only slightly better astraining size increases.

5.3 Discussion

An end-to-end neural model demands quite a bit ofdata in order to learn patterns. It appears that, forlanguages with complicated morphophonologicalalternations, if an intermediate model is trainedon a simple concatenation of morphemes, thesedisadvantages may be counterbalanced. The mor-pheme substrings in the intermediate forms corre-spond predictably to morphosyntactic tags in theparsed form. Subsequent alternations are less pre-dictable. This may explain the intermediate strat-egy’s difference in performance between parsingand generation. A pipelined approach with inter-mediate training is generally not preferable to end-to-end training. The intermediate step inevitablyintroduces errors into the training of the secondneural model. The intermediate strategy’s perfor-mance degradation beyond 35 or 40 thousand ex-amples might indicate that the errors become toodominant.

Comparing our results to the recent CoNLL-SIGMORPHON shared tasks (Cotterell et al.,2016, 2017, 2018), it is surprising that the Arapahodirect generation results at 10,000 examples are solow. However, polysynthetic languages are rare inthe shared task—only one, Navajo, was availablein 2016 and 2017–making it difficult to compareresults on such complicated and varied morpholo-gies. In addition, our data included phenomenawhich could be considered derivational, such asverbal stems signaling animacy and modality (cf.

Sect. 2). Also, since the data was selected ran-domly from the full FST output, the neural modelmay simply have not seen enough repeated stemsin the low settings. Our results are not very goodat the lowest settings but, in future, more in-depthpre-processing and filtering of the data could im-prove overall performance.

The varying results from morphological parsingshown in Figures 5 and 6 demonstrate the prelim-inary nature of this study. The trend between thetwo strategies seems indicative but several morecomparisons should be conducted on similar lan-guages. We hope to conduct a similar study onother low-resource languages for which an FSTexists in order to determine whether the trend willreappear.

6 Conclusion

A sweet spot exists between 10,000 and 30,000randomly selected training examples of Arapahoverbs where better results are achieved in mor-phological generation by first training an encoder-decoder to produce the intermediate forms from anFST than by learning the inflected or parsed formdirectly. For generation, the intermediate strategyachieves the strongest results around 30,000 exam-ples. The results of morphological parsing vary,with the intermediate strategy outperforming thedirect strategy at very low settings but achievingsimilar results with 18,000 and 36,000 training ex-amples. Overall, the intermediate strategy appearsto produce reliably better results at low-resourcesettings than the direct strategy.

Acknowledgements

NVIDIA Corp. donated the Titan Xp GPU used inthis research.

References

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In ICLR.

Kenneth R. Beesley and Lauri Karttunen. 2003. Fi-nite State Morphology. CSLI Publications, Stan-ford, CA.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Geraldine Walther, Ekaterina Vylomova, Arya D.McCarthy, Katharina Kann, Sebastian Mielke, Gar-rett Nicolai, Miikka Silfverberg, David Yarowsky,

85

Jason Eisner, and Mans Hulden. 2018. The CoNLL–SIGMORPHON 2018 shared task: Universal mor-phological reinflection. In Proceedings of theCoNLL SIGMORPHON 2018 Shared Task: Univer-sal Morphological Reinflection, pages 1–27, Brus-sels. Association for Computational Linguistics.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,Geraldine Walther, Ekaterina Vylomova, PatrickXia, Manaal Faruqui, Sandra Kubler, DavidYarowsky, Jason Eisner, and Mans Hulden. 2017.CoNLL-SIGMORPHON 2017 shared task: Univer-sal morphological reinflection in 52 languages. InProceedings of the CoNLL SIGMORPHON 2017Shared Task: Universal Morphological Reinflection,pages 1–30. Association for Computational Linguis-tics.

Ryan Cotterell, Christo Kirov, John Sylak-Glassman,David Yarowsky, Jason Eisner, and Mans Hulden.2016. The SIGMORPHON 2016 shared task—morphological reinflection. In Proceedings of the14th SIGMORPHON Workshop on ComputationalResearch in Phonetics, Phonology, and Morphol-ogy, pages 10–22, Berlin, Germany. Association forComputational Linguistics.

Andrew Cowell and Alonzo Moss Sr. 2008. The Ara-paho Language. University Press of Colorado.

Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with recurrent neural net-works. In International Conference on MachineLearning, pages 1764–1772.

Mans Hulden. 2009. Foma: a finite-state compiler andlibrary. In Proceedings of the 12th Conference ofthe European Chapter of the Association for Com-putational Linguistics, pages 29–32. Association forComputational Linguistics.

Katharina Kann and Hinrich Schutze. 2016. Single-model encoder-decoder with explicit morphologicalrepresentation for reinflection. In Proceedings of the54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers), pages555–560, Berlin, Germany. Association for Compu-tational Linguistics.

Lauri Karttunen. 1993. Finite-state lexicon compiler.Xerox Corporation. Palo Alto Research Center.

Ghazaleh Kazeminejad, Andrew Cowell, and MansHulden. 2017. Creating lexical resources forpolysynthetic languages—the case of Arapaho. InProceedings of the 2nd Workshop on the Use ofComputational Methods in the Study of EndangeredLanguages, pages 10–18, Honolulu. Association forComputational Linguistics.

Guillaume Klein, Yoon Kim, Yuntian Deng, VincentNguyen, Jean Senellart, and Alexander Rush. 2018.OpenNMT: Neural machine translation toolkit. InProceedings of the 13th Conference of the Associa-tion for Machine Translation in the Americas (Vol-ume 1: Research Papers), pages 177–184. Associa-tion for Machine Translation in the Americas.

Kimmo Koskenniemi. 1983. Two-level morphology: Ageneral computational model for word-form recog-nition and production. Publication 11, Univer-sity of Helsinki, Department of General Linguistics,Helsinki.

Krister Linden, Miikka Silfverberg, and Tommi Piri-nen. 2009. HFST tools for morphology—an effi-cient open-source package for construction of mor-phological analyzers. In International Workshop onSystems and Frameworks for Computational Mor-phology, pages 28–47. Springer.

Ling Liu, Ilamvazhuthy Subbiah, Adam Wiemerslage,Jonathan Lilley, and Sarah Moeller. 2018. Mor-phological reinflection in context: CU Boulder’ssubmission to CoNLL-SIGMORPHON 2018 sharedtask. In Proceedings of the CoNLL SIGMORPHON2018 Shared Task: Universal Morphological Re-inflection, pages 86–92, Brussels. Association forComputational Linguistics.

Peter Makarov, Tatiana Ruzsics, and Simon Clematide.2017. Align and copy: UZH at SIGMORPHON2017 shared task for morphological reinflection. InProceedings of the CoNLL SIGMORPHON 2017Shared Task: Universal Morphological Reinflection,pages 49–57, Vancouver. Association for Computa-tional Linguistics.

Sarah Moeller, Ghazaleh Kazeminejad, Andrew Cow-ell, and Mans Hulden. 2018. A neural morpholog-ical analyzer for Arapaho verbs learned from a fi-nite state transducer. In Proceedings of the Work-shop on Computational Modeling of PolysyntheticLanguages, pages 12–20. Association for Compu-tational Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural net-works. In Advances in Neural Information Process-ing Systems (NIPS), pages 3104–3112.

86


Bootstrapping a Neural Morphological Analyzer forSt. Lawrence Island Yupik from a Finite-State Transducer

Lane SchwartzUniversity of Illinois

at [email protected]

Emily ChenUniversity of Illinois

at [email protected]

Benjamin HuntGeorge Mason [email protected]

Sylvia L.R. SchreinerGeorge Mason [email protected]

Abstract

Morphological analysis is a critical enablingtechnology for polysynthetic languages. Wepresent a neural morphological analyzer forcase-inflected nouns in St. Lawrence IslandYupik, an endangered polysythetic languagein the Inuit-Yupik language family, treatingmorphological analysis as a recurrent neuralsequence-to-sequence task. By utilizing anexisting finite-state morphological analyzer tocreate training data, we improve analysis cov-erage on attested Yupik word types from ap-proximately 75% for the existing finite-stateanalyzer to 100% for the neural analyzer. Atthe same time, we achieve a substantiallyhigher level of accuracy on a held-out testingset, from 78.9% accuracy for the finite-stateanalyzer to 92.2% accuracy for our neural an-alyzer.

1 Introduction

St. Lawrence Island Yupik, henceforth Yupik, isan endangered polysynthetic language spoken onSt. Lawrence Island, Alaska and the ChukotkaPeninsula of Russia. Members of the Yupik com-munity on St. Lawrence Island have expressed in-terest in language revitalization and conservation.

Recent work by Chen and Schwartz (2018) re-sulted in a finite-state morphological analyzer forYupik implemented in foma (Hulden, 2009). Thatanalyzer implements the grammatical and mor-phophonological rules documented in A PracticalGrammar of the St. Lawrence Island / SiberianYupik Eskimo Language (Jacobson, 2001).

In this work, we test the coverage of the finite-state analyzer against a corpus of digitized Yupiktexts and find that the analyzer fails to return anyanalysis for approximately 25% of word types (see§2 and Table 1). We present a higher-coverageneural morphological analyzer for case-inflectedYupik nouns that involve no derivational mor-

phology, using the previously-developed finite-state analyzer to generate large amounts of labeledtraining data (§3). We evaluate the performance ofthe finite-state and neural analyzers, and find thatthe neural analyzer results in higher coverage andhigher accuracy (§5), even when the finite-stateanalyzer is augmented with a guessing module tohypothesize analyzes for out-of-vocabulary words(§4). We thus find that a robust high-accuracymorphological analyzer can be successfully boot-strapped from an existing lower-coverage finite-state morphological analyzer (§6), a result whichhas implications for the development of languagetechnologies for Yupik and other morphologically-rich languages.

2 Evaluation of the FST Analyzer

The finite-state morphological analyzer of Chenand Schwartz (2018) implements the grammat-ical and morphophonological rules documentedin A Practical Grammar of the St. Lawrence Is-land / Siberian Yupik Eskimo Language (Jacobson,2001) using the foma finite-state toolkit (Hulden,2009).

In order to evaluate the percentage of attestedYupik word forms for which the finite-state an-alyzer produces any analysis, we began by digi-tizing several hundred Yupik sentences presentedin Jacobson (2001) as examples to be translatedby the reader. We next assembled, digitized, andmanually validated seven texts that each consistof a collection of Yupik stories along with cor-responding English translations. The texts in-clude four anthologies of Yupik stories, legends,and folk tales, along with three leveled elementaryprimers prepared by the Bering Strait School Dis-trict in the 1990s (Apassingok et al., 1993, 1994,1995). Of the four anthologies, three comprise atrilogy known as The Lore of St. Lawrence Is-

87

Text % Coverage Corpus sizeTokens; Types (in words)

Ref 98.24 97.87 795SLI1 79.10 70.62 6859SLI2 77.14 68.87 11,926SLI3 76.98 68.32 12,982Ungi 84.08 73.45 15,766Lvl1 76.64 70.86 4357Lvl2 75.42 72.62 5358Lvl3 77.71 75.19 5731

Table 1: For each Yupik text, the percentage of typesand tokens for which the Yupik finite-state analyzer ofChen and Schwartz (2018) returns an analysis, alongwith the total number of tokens per text. Ref refersto Yupik examples taken from the Jacobson (2001)reference grammar, SLI1 - SLI3 refer to the Lore ofSt. Lawrence Island, volumes 1-3 (Apassingok et al.,1985, 1987, 1989), Ungi is an abbreviation for Ungi-paghaghlanga (Koonooka, 2003), and Lvl1 - Lvl3 re-fer to the elementary Yupik primers (Apassingok et al.,1993, 1994, 1995).

land (Apassingok et al., 1985, 1987, 1989), whilethe last is a stand-alone text, Ungipaghaghlanga(Koonooka, 2003; Menovshchikov, 1988). To-gether, these texts represent the largest known col-lection of written Yupik.

After digitizing each text, we analyzed eachYupik word in that text using the finite-state mor-phological analyzer. We then calculated the per-centage of tokens from each text for which thefinite-state analyzer produced at least one analy-sis. We call this number coverage, and report thisresult for each text in Table 1. The mean coverageover the entire set of texts was 77.56%. The neu-ral morphological analyzer described in the sub-sequent section was developed in large part toprovide morphological analyses for the remaining22.44% of heretofore unanalyzed Yupik tokens.

.

3 Yupik Morphological Analysis asMachine Translation

The task of morphological analysis can be re-garded as a machine translation (sequence-to-sequence) problem, where an input sequence thatconsists of characters or graphemes in the sourcelanguage (surface form) is mapped to an outputsequence that consists of characters or graphemesand inflectional tags (underlying form). For exam-ple, in English, the sequence of characters repre-

senting the surface word form foxes can be trans-formed into a sequence of characters representingthe root word fox and the inflectional tag indicat-ing plurality:

f o x e s↓

f o x [PL]

In contrast to English, Yupik is highly productivewith respect to derivational and inflectional mor-phology.1 See §3.1.1 for noun inflection tags.

(1) kaviighetkaviigh- -∼sf-w:(e)tfox- -ABS.UNPD.PL

‘foxes’

But in much the same way, (1) can be rewrit-ten as a translation process from an input sequenceof graphemes that represent the surface form toan output sequence of graphemes and inflectionaltags that represent the underlying form:

k a v i i gh e t↓

k a v i i gh [ABS] [UNPD] [PL]

3.1 Generating Data from an FSTVery little Yupik data has previously been manu-ally annotated in the form of interlinear glosses.On the other hand, the finite-state morphologicalanalyzer of Chen and Schwartz (2018) is capableof generating Yupik surface forms from providedunderlying forms, and vice versa. In the followingsections, bracketed items [..] introduce the inflec-tional tags that are used in the underlying forms ofthe finite-state analyzer.

3.1.1 Basic Yupik NounsYupik nouns inflect for one of seven grammaticalcases:

1. ablative-modalis [ABL MOD]2. absolutive [ABS]3. equalis [EQU]4. localis [LOC]5. terminalis [TER]6. vialis [VIA]7. relative [REL]

1Each derivational and inflectional suffix is associatedwith a series of morphophonological rules. Each rule is rep-resented by a unique symbol such as – or ∼sf as introduced inJacobson (2001). This convention is used in the Yupik gram-mar, in the Yupik-English dictionary (Badten et al., 2008),and by Chen and Schwartz (2018). We therefore follow thisconvention, employing these symbols in our glosses here. SeeTable 2 on the following page for more details.

88

Symbol Description∼ Drops e in penultimate (semi-final)

position or e in root-final positionand hops it

e-Hopping is the process by whichvowels i, a, or u in the first syllableof the root are lengthened as a re-sult of dropping semi-final or final-e, so termed because it is as if thee has “hopped” into the first syllableand assimilated. e-Hopping will notoccur if doing so results in a three-consonant cluster within the word ora two-consonant cluster at the begin-ning (Jacobson, 2001).

∼f Drops final e and hops it∼sf Drops semi-final e and hops it-w Drops weak final consonants, that

is, gh that is not marked with an *.Strong gh is denoted gh*

: Drops uvulars that appear betweensingle vowels

– Drops final consonants–– Drops final consonants and preced-

ing vowel@ Indicates some degree of modifica-

tion to root-final te, the degree ofwhich is dependent on the suffix

+ Indicates no morphophonology oc-curs during affixation. This sym-bol is implicitly assumed if no othersymbols are present.

Table 2: List of documented morphophonological rulesin Yupik and their lexicalized symbols. For more de-tails see Jacobson (2001) and Badten et al. (2008).

Nouns then remain unpossessed [UNPD] and in-flect for number (singular [SG], plural [PL], ordual [DU]), or inflect as possessed nouns. Pos-sessed nouns are marked for number, and for theperson and number of the possessor. For example,[1SGPOSS][SGPOSD] marks a possessed singularnoun with a first person singular possessor.

The Badten et al. (2008) Yupik-English dic-tionary lists 3873 noun roots, and the Jacobson(2001) reference grammar lists 273 nominal in-flectional suffixes. We deterministically generateddata by exhaustively pairing every Yupik nounroot with every inflectional suffix (ignoring any se-mantic infelicity that may result). As shown in Ta-

Root Case Poss Posd Total3873 × 7 × 1 × 3 = 81,3333873 × 7 × 12 × 3 = 975,996

1,057,329

Table 3: Extracted training data. The first row countsthe total number of unpossessed nouns which aremarked for number: [SG], [PL], [DU]. The second rowcounts the total number of possessed nouns which aremarked for number and also for 12 differing types ofpossessors, which themselves are marked for person,[1-4], and number, [SG], [PL], [DU].

ble 3 above, the underlying forms that result fromthese pairings map to just over 1 million inflectedYupik word forms or surface forms. Note thatthese word forms represent morphologically licitforms, but not all are attested. Even so, we did notexclude unattested forms nor weight them accord-ing to plausibility, since we lack sufficient docu-mentation to distinguish the valid forms from thesemantically illicit ones. This parallel dataset ofinflected Yupik nouns and their underlying formsrepresents our corpus.

3.1.2 Identified Flaw in Existing Yupik FSTWhile generating data using the finite-state ana-lyzer, we observed a minor bug. Specifically, thefinite-state analyzer fails to account for the allo-morphy that is triggered on some noun roots whenthey are inflected for a subset of the 3rd person pos-sessor forms (root-final -e surfaces as -a).

(2) neqangitneqe- -∼:(ng)itfood- -ABS.PL.3PLPOSS

‘their foods’

As shown in (2), the correct surface form forneqe[ABS][PL][3PLPOSS] is neqangit, but theanalyzer incorrectly generates neqngit instead.Due to time constraints and the relatively small es-timated impact, we did not modify the analyzer tocorrect this bug. Though this impacts the trainingdata, having previously evaluated the analyzer, wedo not believe this error to be egregious enough tocompromise the generated dataset.

3.1.3 Yupik Verbs and Derivational SuffixesWhile our training set for this work is the set ofinflected Yupik nouns described in §3.1.1, it isimportant to note that this process could in prin-ciple be used to generate a much larger trainingset. The Badten et al. (2008) Yupik-English dic-

89

Sequence Type Tokenize by Character Tokenize by Graphemesurface form q i k m i q q i k mm i q

underlying form q i k m i gh [N] [ABS] [UNPD] [SG] q i k mm i gh [N] [ABS] [UNPD] [SG]

Table 4: Contrasts the two tokenization methods introduced in § 3.2.1 (tokenization by character and tokeniza-tion by orthographically transparent Yupik grapheme) on the surface form qikmiq (dog) and its underlying formqikmigh[N][ABS][UNPD][SG].

Possible Possible# Nouns Verbs Both0 1.06× 1006 8.80× 1006 9.86× 1006

1 1.37× 1008 2.04× 1009 2.18× 1009

2 2.22× 1010 4.19× 1011 4.41× 1011

3 4.03× 1012 8.31× 1013 8.72× 1013

4 7.66× 1014 1.63× 1016 1.71× 1016

5 1.48× 1017 3.18× 1018 3.33× 1018

6 2.88× 1019 6.21× 1020 6.50× 1020

7 5.61× 1021 1.21× 1023 1.27× 1023

Table 5: Number of morphotactically possible Yupikword forms formed using 0-7 derivational suffixes.

tionary and Jacobson (2001) Yupik grammar alsolist 3762 verb roots along with 180 intransitive and2160 transitive verbal inflectional morphemes. Ifone were to naively assume that every Yupik verbcan be either transitive or intransitive,2 another8.8 million training examples consisting of Yupikverbs could be generated.

Yupik also exhibits extensive derivational mor-phology. The dictionary lists 89 derivational suf-fixes that can each attach to a noun root and yieldanother noun, 58 derivational suffixes that caneach attach to a noun root and yield a verb, 172derivational suffixes that can each attach to a verbroot and yield another verb, and 42 derivationalsuffixes that can each attach to a verb root andyield a noun. Yupik words containing up to sevenderivational morphemes have been attested in theliterature (de Reuse, 1994). By considering allpossible Yupik nouns and verbs with up to sevenderivational morphemes, well over 1.2 × 1023 in-flected Yupik word forms could be generated asshown in Table 5 above. As before, many of theseforms, while morphologically valid, would be syn-tactically or semantically illicit.

3.2 Neural Machine TranslationIn this work, we made use of Marian (Junczys-Dowmunt et al., 2018), an open-source neural ma-

2The Yupik-English dictionary does not annotate verbroots with valence information.

chine translation framework that supports bidi-rectional recurrent encoder-decoder models withattention (Schuster and Paliwal, 1997; Bahdanauet al., 2014). In our experiments using Marian,we trained neural networks capable of translatingfrom input sequences of characters or graphemes(representing Yupik words) to output sequencesof characters or graphemes plus inflectional tags(representing an underlying Yupik form).

3.2.1 DataWe began by preprocessing the data described in§3.1.1 by tokenizing each Yupik surface form, ei-ther by characters or by orthographically trans-parent, redoubled graphemes. An example canbe seen in the transformation of the word kavi-ighet at the beginning of §3, where each char-acter and inflectional tag were separated by aspace. When tokenizing by redoubled graphemes,orthographically non-transparent graphemes werefirst replaced following the approach described inSchwartz and Chen (2017) which ensures there isonly one way to tokenize a Yupik word form. Thisapproach undoes an orthographic convention thatshortens the spelling of words by exploiting thefollowing facts:

• Graphemic doubling conveys voicelessness(g represents the voiced velar fricative whilegg represents the voiceless velar fricative)

• Consecutive consonants in Yupik typicallyagree in voicing, with the exception of voice-less consonants that follow nasals

Yupik orthography undoubles a voicelessgrapheme if it co-occurs with a second voicelessgrapheme, according to the following threeUndoubling Rules (Jacobson, 2001):

1. A fricative is undoubled next to a stop or oneof the voiceless fricatives where doubling isnot used to show voicelessness (f, s, wh, h).

2. A nasal is undoubled after a stop or one ofthe voiceless fricatives where doubling is notused to show voicelessness.

90

3. A fricative or nasal is undoubled when itcomes after a fricative where doubling is usedto show voicelessness, except that if the sec-ond fricative is ll then the first fricative is un-doubled instead.

These two tokenization methods subsequentlyproduced two parallel corpora, whose trainingpairs differed as seen in Table 4 on the previouspage. Nevertheless, the input data always corre-sponded to Yupik surface forms, and the outputdata corresponded to underlying forms. The par-allel corpora were then randomly partitioned intoa training set, a validation set, and a test set in a0.8/0.1/0.1 ratio.

3.2.2 Initial ExperimentUsing Marian, we used the data tokenized by char-acters and trained a shallow neural network modelthat implemented an attentional encoder-decodermodel (Bahdanau et al., 2014) with early stoppingand holdout cross validation. We used the parame-ters described in Sennrich et al. (2016), where theencoder and decoder consisted of one hidden layereach, of size 1024. Of the 109,395 items in thefinal test set, this shallow neural model achieved100% coverage and 59.67% accuracy on the testset.

Error analysis revealed a substantial amount ofunderspecification and surface form ambiguity asa result of syncretism in the nominal paradigm. Asexemplified in (3a) and (3b), inflectional suffixesin Yupik may share the same underlying phono-logical form as well as the same morphophono-logical rules associated with that suffix.

(3a) ayveghetayvegh- -∼sf-w:(e)twalrus- -ABS.UNPD.PL

‘walruses’

(3b) ayveghetayvegh- -∼sf-w:(e)twalrus- -REL.UNPD.PL

‘of walruses’

For example, any noun that is inflectedfor the unpossessed absolutive plural,[N][ABS][UNPD][PL], produces a word-formthat is identical to the form yielded when the nounis inflected for the unpossessed relative plural,[N][REL][UNPD][PL]. The generated paralleldata therefore includes the following two parallel

forms, both of which have the exact same surfaceform. The first is the word in absolutive case:

a y v e g h e t↓

a y v e g h [N] [ABS] [UNPD] [PL]

The second is the word in relative case:

a y v e g h e t↓

a y v e g h [N] [REL] [UNPD] [PL]

Since these surface forms are only distinguish-able through grammatical context, and our neu-ral analyzer was not trained to consider context,it was made to guess which underlying form to re-turn, and as suggested by the low accuracy scoreof 59.67%, the analyzer’s guesses were often in-correct. We did not think it was proper to penal-ize the analyzer for wrong answers in instances ofsyncretism, and consequently implemented a post-processing step to account for this phenomenon.

This step was performed after the initial calcu-lation of the neural analyzer’s accuracy score, andprovided an estimated or adjusted accuracy scorethat considered the syncretic forms equivalent. Ititerated through all outputs of the neural analyzerthat were initially flagged as incorrect for differingfrom their test set counterparts. Using the finite-state analyzer, the surface forms for each outputand its corresponding test set item were then gen-erated to verify whether or not their surface formsmatched. If they matched, the neural analyzer’soutput was instead counted as correct (see Table 6on the following page for examples). Assessed inthis way, the shallow model achieved an adjustedaccuracy score of 99.90%.

3.2.3 Data RevisitedAlthough the postprocessing step is sufficient todemonstrate the true performance of the neuralanalyzer, we attempted to resolve this ambiguityissue with a more systematic approach. In theirdevelopment of a neural morphological analyzerfor Arapaho verbs, Moeller et al. (2018) conflatedtags that resulted in ambiguous surface forms intoa single, albeit less informative, tag, such as [3-SUBJ], joined from [3SG-SUBJ] and [3PL-SUBJ].We attempted to do the same for Yupik by col-lapsing the tag set, but Yupik presents a some-what more intricate ambiguity patterning. Syn-cretic tags can differ in their case markings alone,as in (3a) and (3b), but they can also differ across

91

Neural Analyzer Output Surface Gold Standard Surfaceanipa[N][ABS][UNPD][PL] anipat anipa[N][REL][UNPD][PL] anipat 3

wayani[N][LOC][UNPD][PL] wayani wayagh[N][LOC][UNPD][PL] wayani 3

suflu[N][LOC][UNPD][PL] sufluni suflugh[N][ABS][4SGPOSS][SGPOSD] sufluni 3

puume[N][LOC][UNPD][DU] pumegni puu[N][LOC][4DUPOSS][SGPOSD] puumegneng 7

Table 6: An illustration of the process of the post-processing step that was implemented to resolve surface formambiguity. If the output and its gold standard match in their surface forms, the output is then considered correct,despite the mismatch in the underlying forms.

case, possessor type, and number, as seen in (4a)and (4b).

(4a) neghsamengneghsagh- -∼f-wmengseal- -ABL MOD.UNPD.SG‘seal (as indefinite object); from seal’

(4b) neghsamengneghsagh- -∼f-wmengseal- -REL.PL.4DUPOSS‘their2 (reflexive) seals’

As a result, we could not conflate our tags in thesame way Moeller et al. (2018) did. Instead, foreach set of syncretic tags, one string of tags wasselected to represent all of the tags in the set,such that [N][ABS][UNPD][PL] denoted both un-possessed absolutive plural and unpossessed rela-tive plural. The original 273 unique strings of tags(7 cases × 13 possessor types × 3 number mark-ers) were consequently reduced to 170 instead.

Having identified and reduced tag set ambigu-ity, we retrained the shallow model but only man-aged to achieved an unadjusted accuracy scoreof 95.48%. Additional error analysis revealedthat some surface form ambiguity remained, butamong non-syncretic tags that could not be col-lapsed. In other words, these tags generated iden-tical surface forms for some nouns but not others.This is shown in (5a) – (5d):

(5a) suflunisuflu- -–nicave- -ABS.SG.4SGPOSS‘his/her (reflexive) own cave’

(5b) suflunisuflu- -∼f-wnicave- -LOC.UNPD.PL‘in the cave’

(5c) suflunisuflug- -–nichimney- -ABS.SG.4SGPOSS‘his/her (reflexive) own chimney’

(5d) suflugnisuflug- -∼f-wnichimney- -LOC.UNPD.PL‘in the chimney’

Thus, despite the identical surface forms shownin (5a) and (5b), these same inflection tags donot result in identical surface forms for the nounroot suflug in (5c) and (5d), since the underly-ing inflectional suffixes they represent are dis-tinct: –ni versus ∼f-wni. As such, these twostrings of tags, [N][ABS][4SGPOSS][SGPOSD]and [N][LOC][UNPD][PL], cannot be collapsed.

Since the tag set cannot be reduced further than170 tags, we must invoke the post-processing stepintroduced in §3.2.2 regardless. Moreover, sinceour proposed method results in some loss of in-formation with respect to all possible underly-ing forms, we will have to seek an alternativemethod for handling syncretism. Nevertheless, af-ter applying the post-processing step, the retrainedmodel also achieved an adjusted accuracy score of99.90%.

3.2.4 Additional ExperimentsWe trained four models on the inflected nounsdataset, experimenting with the shallow versusdeep neural network architectures and the two to-kenization methods: by characters and by redou-bled graphemes. The shallow neural models wereidentical to those described in §3.2.2 and §3.2.3.The deep neural models used four hidden layersand LSTM cells, following Barone et al. (2017).As before, all models were trained to convergenceand evaluated with holdout cross validation on thesame test set. Results are presented in Table 7 onthe next page, along with the accuracy scores be-fore and after resolving all surface ambiguities.

92

0 1 2 3 4 5ε

C

V ε

C

C V

ε

C

Figure 1: Finite-state diagram depicting legal word structure for nouns in Yupik. Here, V may refer to either a shortvowel, -i, -a, -u, -e, or a long vowel, -ii, -aa, -uu, and C refers to any consonantal Yupik grapheme. Additionally,any noun-final C is restricted to be -g, -gh, -ghw, or -w.

Model Tokenization Accuracy Adjustedshallow char 95.37 99.87deep char 95.07 99.95shallow redoubled 95.48 99.90deep redoubled 95.17 99.96

Table 7: Accuracy and adjusted accuracy scores on thegenerated test data from §3.2.1 (before and after resolv-ing all surface ambiguity) for each model. The boldedpercentage indicates the highest-performing model onthe heldout test data.

While all models reached over 99% adjustedaccuracy, the deep models outperformed theirshallow counterparts, and the models trained ondata tokenized by redoubled graphemes faredmarginally better than those trained on data tok-enized by individual characters. The latter mayresult from the fact that some inflections operateon full graphemes, for instance, gh#→ q# duringinflection for the unpossesed absolutive singular.The percentage improvement is so slight, however,that this may not be of much consequence. Thedeep model trained on redoubled graphemes wasmost accurate, peaking at 99.96%.

Despite comparable accuracy scores, there didnot appear to be a discernible pattern with respectto errors among the four models.

4 Finite-State Guesser

We modified the finite-state analyzer of Chen andSchwartz (2018) by implementing a guesser mod-ule; this guesser permits the analyzer to hypoth-esize possible roots not present in its lexicon thatnevertheless adhere to Yupik phonotactics and syl-lable structure. Noun roots, for example, may onlyend in a vowel, -g, -gh, -ghw, or -w, and follow thestructural pattern given by the regular expressionbelow:

(C) V (C) (C V (C))*

Thus, any guess made for a noun would adhereto this patterning, which the finite-state diagramin Figure 1 above captures visually. Moreover,the guesser was implemented as a backoff mecha-nism, such that it was only called if no other anal-ysis could be determined. Guesses were also la-beled with an additional tag, [GUESS], to distin-guish them from the output returned by the finite-state analyzer itself.

5 Comparing Neural vs. Finite-StateAnalyzers

The final experimental condition involved compar-ing and quantifying the performance of the neuralanalyzer against its equivalent finite-state counter-part. Because the test set used in §3 was generatedby the finite-state Yupik analyzer, it would be un-fair to contrast the performance of the neural ana-lyzer and the finite-state analyzer on this dataset,as the finite-state analyzer would be guaranteed toachieve 100% accuracy.

5.1 Blind Test Set

Instead, we made use of the nouns from a pub-lished corpus of Yupik interlinear glosses as an un-seen test set: Mrs. Della Waghiyi’s St. LawrenceIsland Yupik Texts With Grammatical Analysis byKayo Nagai (2001), a collection of 14 orally-narrated short stories that were later transcribed,annotated and glossed by hand. From these sto-ries, we extracted all inflected nouns with no inter-vening derivational suffixes to form a modest eval-uation corpus of 360 words. This was then pareddown further to 349 words by removing 11 En-glish borrowings that had been inflected for Yupiknominal cases. Manual examination of the evalua-tion corpus revealed several problematic items thatwe believe represent typos or other errors. Thesewere removed from the final evaluation set.

93

for types Coverage AccuracyFST (No Guesser) 85.78 78.90FST (w/Guesser) 100 84.86Neural 100 92.20

for tokens Coverage AccuracyFST (No Guesser) 85.96 79.82FST (w/Guesser) 100 84.50Neural 100 91.81

Table 8: Comparison of coverage and accuracy scoreson the blind test set (§5.1), contrasting the finite-stateand neural analyzers. Accuracy is calculated over alltypes and tokens.

5.2 Blind Test Results

We analyzed the data from §5.1 using the origi-nal finite-state analyzer, the finite-state analyzerwith guesser (§4), and the best-performing neu-ral model from §3.2.4 (the deep model trained ongraphemes).

Accuracy scores for the neural and finite-stateanalyzers, when evaluated on this refined corpus,are reported for types and tokens in Table 8 above,where accuracy was calculated over the total num-ber of types and tokens, respectively. On the blindtest set, the neural analyzer achieved coverage of100% and accuracy of over 90%, outperformingboth finite-state analyzers.

6 Discussion

Of the two analyzers which manage to achievemaximum coverage by outputting a parse for eachitem encountered, the neural analyzer consistentlyoutperforms the finite-state analyzer, even whenthe finite-state analyzer is supplemented with aguesser. Furthermore, as illustrated in Tables 9and 10, the neural analyzer is also more adept atgeneralizing.

6.1 Capacity to Posit OOV Roots

Out-of-vocabulary (OOV) roots are those rootsthat appear in the evaluation corpus extracted fromNagai and Waghiyi (2001) that do not appear inthe lexicon of the finite-state analyzer nor in theBadten et al. (2008) Yupik-English dictionary. Ofthe seven unattested roots identified in the corpus,the neural analyzer returned a correct morpholog-ical parse for three of them while the finite-stateanalyzer only returned two (see Table 9).

Unattested Root FST NNaghnasinghagh – –aghveghniigh – 3

akughvigagh 3 3

qikmiraagh – –sakara 3 –sanaghte – –tangiqagh – 3

Table 9: Comparison of finite-state and neural ana-lyzer’s performances on unattested roots, which are un-accounted for in both the lexicon of the finite-state an-alyzer and the Badten et al. (2008) Yupik-English dic-tionary. A checkmark indicates that the correct mor-phological analysis was returned by that analyzer.

6.2 Capacity to Handle Spelling Variation

The neural analyzer performed even better withspelling variation in Yupik roots (see Table 10).Of the three spelling variants identified in the cor-pus, all of them differed from their attested formswith respect to a single vowel, -i- versus -ii-. Theneural analyzer returned the correct morphologi-cal parse for all spelling variants, while the finite-state analyzer supplemented with the guesser onlysucceeded with one, melqighagh. Moreover, theneural analyzer even managed to guess at the cor-rect underlying root in spite of a typo in one of thesurface forms (ukusumun rather than uksumun).

Root Variant FST NNmelqighagh 3 3

piitesiighagh – 3

uqfiilleghagh – 3

*ukusumun – 3

Table 10: Comparison of finite-state and neural an-alyzer’s performances on root variants, which arespelled differently from their attested counterparts inthe lexicon of the finite-state analyzer and Badten et al.(2008). A checkmark implies that the analyzer returnedthe correct morphological analysis while the asterisk *denotes an item with a typo.

6.3 Implications for Linguistic Fieldwork

The higher-quality performance of the neural ana-lyzer has immediate implications for future com-putational endeavors and field linguists working inYupik language documentation and longer-rangeimplications for field linguists in general. Withrespect to fieldwork, a better-performing analyzerwith greater and more precise coverage equates

94

to better real-time processing of data for field lin-guists performing morphological analysis and in-terlinear glossing.

The neural analyzer is also immune to overgen-eration, since it relies on a machine translationframework that returns the “best” translation forevery input sequence; in our case, this equatesto one morphological analysis for each surfaceform. This contrasts with the finite-state analyzervariants that may return hundreds or thousands ofanalyses for a single surface form if the finite-statenetwork permits. For instance, it was found thatwithin the text corpus of the Jacobson (2001) ref-erence grammar alone (see Table 1), the Yupikfinite-state analyzer (without guesser) generatedover 100 analyses for each of 16 word types. Theword with the greatest number of analyses, laa-lighfiknaqaqa, received 6823 analyses followedby laalighfikiikut with 1074 (Chen and Schwartz,2018). In this way, to have developed a neural an-alyzer that returns one morphological analysis persurface form is valuable to field linguists as it doesnot require them to sift through an indiscriminatenumber of possibilities.

6.4 Application to Other LanguagesAside from our procedure to tokenize Yupik wordsinto sequences of fully transparent graphemes,exceedingly little language-specific preprocessingwas performed. Our general procedure consists ofcreating a parallel corpus of surface and underly-ing forms by iterating through possible underlyingforms and using a morphological finite-state trans-ducer to generate the corresponding surface forms.We believe that this procedure of bootstrapping alearned morphological analyzer through the lensof machine translation should be generally appli-cable to other languages (especially, but certainlynot exclusively, those of similar typology).

6.5 Added Value of FST AnalyzersFinally, the methodology employed here, in whicha neural analyzer is trained with data generatedfrom an existing finite-state implementation, is in-herently valuable. Though the development offinite-state morphological analyzers demands con-siderable effort, the fact that their output may beleveraged in the development of better-performingsystems is especially practical for under-resourcedlanguages such as Yupik, where any form of train-ing data is scarce. Thus, finite-state analyzers mayserve a twofold purpose: that of morphological

analysis, as they were intended to be used, but alsofor the generation of training data to train neuralsystems.

7 Conclusion

Morphological analysis is a critical enablingtechnology for polysynthetic languages such asSt. Lawrence Island Yupik. In this work we haveshown that the task of learning a robust high-accuracy morphological analyzer can be boot-strapped from an existing finite-state analyzer.Specifically, we have shown how this can be doneby framing the problem as a machine translationtask. We have successfully trained a neural mor-phological analyzer for derivationally unaffixednouns in St. Lawrence Island Yupik, and comparedits performance with that of its existing finite-stateequivalent with respect to accuracy.

This work represents a case where the studenttruly learns to outperform its teacher. The neu-ral analyzer produces analyses for all Yupik wordtypes it is presented with, a feat that the originalfinite-state system fails to achieve. At the sametime, the neural analyzer achieves higher accuracythan either the original finite-state analyzer or avariant FST augmented with a guesser. The neuralanalyzer is capable of correctly positing roots forout-of-vocabulary words. Finally, the neural ana-lyzer is capable of correctly handling variation inspelling.

In future work, we plan to explore more thor-ough methods for handling ambiguous surfaceforms. We also plan to correct the minor FST erroridentified in §3.1.2. Most importantly, the trainingdataset will be extended to include items beyondinflected nouns with no intervening derivationalsuffixes. Specifically, we intend to increase thetraining set to include verbs, particles, and demon-stratives in addition to nouns, as well as words thatinclude derivational suffixes.

ReferencesAnders Apassingok, (Iyaaka), Jessie Uglowook,

(Ayuqliq), Lorena Koonooka, (Inyiyngaawen), andEdward Tennant, (Tengutkalek), editors. 1993.Kallagneghet / Drumbeats. Bering Strait SchoolDistrict, Unalakleet, Alaska.

Anders Apassingok, (Iyaaka), Jessie Uglowook,(Ayuqliq), Lorena Koonooka, (Inyiyngaawen), andEdward Tennant, (Tengutkalek), editors. 1994. Aki-ingqwaghneghet / Echoes. Bering Strait School Dis-trict, Unalakleet, Alaska.

95

Anders Apassingok, (Iyaaka), Jessie Uglowook,(Ayuqliq), Lorena Koonooka, (Inyiyngaawen), andEdward Tennant, (Tengutkalek), editors. 1995. Su-luwet / Whisperings. Bering Strait School District,Unalakleet, Alaska.

Anders Apassingok, (Iyaaka), Willis Walunga, (Ke-pelgu), and Edward Tennant, (Tengutkalek), edi-tors. 1985. Sivuqam Nangaghnegha — SiivanllemtaUngipaqellghat / Lore of St. Lawrence Island —Echoes of our Eskimo Elders, volume 1: Gambell.Bering Strait School District, Unalakleet, Alaska.

Anders Apassingok, (Iyaaka), Willis Walunga, (Ke-pelgu), and Edward Tennant, (Tengutkalek), edi-tors. 1987. Sivuqam Nangaghnegha — SiivanllemtaUngipaqellghat / Lore of St. Lawrence Island —Echoes of our Eskimo Elders, volume 2: Savoonga.Bering Strait School District, Unalakleet, Alaska.

Anders Apassingok, (Iyaaka), Willis Walunga, (Ke-pelgu), and Edward Tennant, (Tengutkalek), edi-tors. 1989. Sivuqam Nangaghnegha — SiivanllemtaUngipaqellghat / Lore of St. Lawrence Island —Echoes of our Eskimo Elders, volume 3: South-west Cape. Bering Strait School District, Unalak-leet, Alaska.

Linda Womkon Badten, (Aghnaghaghpik), Vera OoviKaneshiro, (Uqiitlek), Marie Oovi, (Uvegtu), andChristopher Koonooka, (Petuwaq). 2008. St.Lawrence Island / Siberian Yupik Eskimo Dictio-nary. Alaska Native Language Center, Universityof Alaska Fairbanks.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Antonio Valerio Miceli Barone, Jindrich Helcl, RicoSennrich, Barry Haddow, and Alexandra Birch.2017. Deep architectures for neural machine trans-lation. arXiv preprint arXiv:1707.07631.

Emily Chen and Lane Schwartz. 2018. A morpho-logical analyzer for St. Lawrence Island / Cen-tral Siberian Yupik. In Proceedings of the 11thLanguage Resources and Evaluation Conference(LREC’18), Miyazaki, Japan.

Mans Hulden. 2009. Foma: a finite-state compiler andlibrary. In Proceedings of the 12th Conference ofthe European Chapter of the Association for Com-putational Linguistics, pages 29–32. Association forComputational Linguistics.

Steven A. Jacobson. 2001. A Practical Grammar ofthe St. Lawrence Island/Siberian Yupik Eskimo Lan-guage, 2nd edition. Alaska Native Language Center,University of Alaska Fairbanks, Fairbanks, Alaska.

Marcin Junczys-Dowmunt, Roman Grundkiewicz,Tomasz Dwojak, Hieu Hoang, Kenneth Heafield,Tom Neckermann, Frank Seide, Ulrich Germann,Alham Fikri Aji, Nikolay Bogoychev, Andre F. T.

Martins, and Alexandra Birch. 2018. Marian: Fastneural machine translation in C++. In Proceedingsof ACL 2018, System Demonstrations, pages 116–121, Melbourne, Australia. Association for Compu-tational Linguistics.

Christopher (Petuwaq) Koonooka. 2003. Ungi-paghaghlanga: Let Me Tell You A Story. AlaskaNative Language Center.

G.A. Menovshchikov. 1988. Materialiy i issledovaniiapo iaziku i fol’kloru chaplinskikh eskimosov. Nauka,Leningrad. Traditional Yupik stories, with stress andlength indicated; includes Russian translation.

Sarah Moeller, Ghazaleh Kazeminejad, Andrew Cow-ell, and Mans Hulden. 2018. A neural morpholog-ical analyzer for Arapaho verbs learned from a fi-nite state transducer. In Proceedings of the Work-shop on Computational Modeling of PolysyntheticLanguages, Santa Fe, New Mexico. Association forComputational Linguistics.

Kayo Nagai and Della Waghiyi. 2001. Mrs. DellaWaghiyi’s St. Lawrence Island Yupik texts withgrammatical analysis, volume A2-006 of Endan-gered Languages of the Pacific Rim Publications Se-ries. ELPR, Osaka.

Willem J. de Reuse. 1994. Siberian Yupik Eskimo —The Language and Its Contacts with Chukchi. Stud-ies in Indigenous Languages of the Americas. Uni-versity of Utah Press, Salt Lake City, Utah.

Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-tional recurrent neural networks. IEEE Transactionson Signal Processing, 45(11):2673–2681.

Lane Schwartz and Emily Chen. 2017. Liinnaqumal-ghiit: A web-based tool for addressing ortho-graphic transparency in St. Lawrence Island/CentralSiberian Yupik. Language Documentation and Con-servation, 11:275–288.

Rico Sennrich, Barry Haddow, and AlexandraBirch. 2016. Edinburgh neural machine trans-lation systems for WMT 16. arXiv preprintarXiv:1606.02891.

96

Author Index

Arkhangelskiy, Timofey, 56Arppe, Antti, 64

Bender, Emily M., 28Berberova, Remziye, 74Brubeck Unhammer, Kevin , 46

Chen, Emily, 87Cowell, Andrew, 81

Everson, Rebecca, 1

Foley, Ben, 14

Gessler, Luke, 6Gökırmak, Memduh, 74Grimm, Scott, 1

Hämäläinen, Mika, 39Harrigan, Atticus, 64Honoré, Wolf, 1Howell, Kristen, 28Howell, Nick, 74Hulden, Mans, 81Hunt, Benjamin, 87

Kavitskaya, Darya, 74Kazeminejad, Ghazaleh, 81

Mills, Timothy, 64Moeller, Sarah, 81

Nørstebø Moshagen, Sjur , 46

Rueter, Jack, 39

San, Nay, 14Santos, Eddie Antonio, 23Schreiner, Sylvia L.R., 87Schwartz, Lane, 87

Tyers, Francis M., 74

van Esch, Daan, 14

Washington, Jonathan, 74Wiechetek, Linda, 46

Zamaraeva, Olga, 28

97

proceedings of the 55th annual meeting of the association for … · 2019. 8. 6. · karol...

Documents