fieldwork as a computational problem uniting computational

10
The Human Language Project: Uniting Computational Linguistics with Documentary Linguistics Steven Bird University of Melbourne & University of Pennsylvania Fieldwork as a Computational Problem • three data types • three kinds of metadata • relations • computational challenge • http://www.ldc.upenn.edu/sb/fieldwork/ • this isn't computational linguistics Convergences • concern with data • use of speech data • bilingual text Convergences: Bitext + morph = IGT bilingual text morphologically analyzed text comparative wordlists bilingual lexicons

Upload: others

Post on 07-Dec-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fieldwork as a Computational Problem Uniting Computational

The Human Language Project:Uniting Computational Linguisticswith Documentary Linguistics

Steven Bird

University of Melbourne &University of Pennsylvania

Fieldwork as a Computational Problem

• three data types

• three kinds of metadata

• relations

• computational challenge

• http://www.ldc.upenn.edu/sb/fieldwork/

• this isn't computational linguistics

Convergences

• concern with data

• use of speech data

• bilingual text

Convergences:Bitext + morph = IGT

• bilingual text

• morphologically analyzed text

• comparative wordlists

• bilingual lexicons

Page 2: Fieldwork as a Computational Problem Uniting Computational

Documentary and Descriptive Linguistics

Nikolaus Himmelmann (1998) "Documentary and Descriptive Linguistics" Linguistics 36:161-195

Documentation types:Interlinear text

Guwamu, Peter Austin (2010)

Documentation types:Lexicons

Kröger, F. Buli-English dictionary: With an Introductory Grammar and an Index. Münster: Lit, 1992.

Documentary and Descriptive LinguisticsUse of Computation

Nikolaus Himmelmann (1998) "Documentary and Descriptive Linguistics" Linguistics 36:161-195

• documentarists

• innovation, tool development

• descriptivists

• Evans, Hyman

Page 3: Fieldwork as a Computational Problem Uniting Computational

Karaim CD-ROMEva Csato and David Nathan

Nathan, D. (1998) The spoken Karaim CD: Sound, text, lexicon and "active morphology" for language learning multimedia, Proceedings of the Ninth Annual Conference on Turkish Linguistics.

Where's the science?

After years of neglect in which linguistics lost sight of the value of empirical field research, new life has finally been breathed into this fundamentally important component of our discipline. But in the process, linguistic fieldwork has ironically lost sight of linguistics! That is, if by linguistics one means the scientific study of language, fieldwork ideology and practice have gone askew. The major movements and individuals that we can thank for the resurgence of interest in linguistic fieldwork all promote (in words or deeds) approaches to field research that fall far short of the tenets of science. Examples of such misguided directions include (a) the endangered languages movement, (b) language documentation, and (c) the "Dixon school".! In my talk, I expose the failings of these non-scientific approaches to linguistic field research and set out what would be required for linguistic fieldwork to qualify as truly scientific and thus be entitled to recognition as an essential subfield within linguistics per se.

Paul Newman -- Linguistic Fieldwork as a Scientific Enterprise, International Conference on Language

Key Questions

• What does computational linguistics offer to the problem of documenting and describing the world's languages?

• How can CL help improve the descriptive value of language documentation?

• three places where this might happen Basic Oral Language Documentation

Page 4: Fieldwork as a Computational Problem Uniting Computational

Pilot projectSynopsis of 1 weekin Moife

1. Discussions re orthography, literacy

2. training, practice, listening, tone orthography experiment

3. training in oral transcription and translation; gave out recorders

4. re-assigned recorders

5. (Saturday)

6. oral transcription, vitality survey, orthography recommendations

7. more oral transcription

Pilot project

Page 5: Fieldwork as a Computational Problem Uniting Computational

Main Phase

Page 6: Fieldwork as a Computational Problem Uniting Computational

Preparation

• Batteries

• Date

• Identifiers

Training Training

Page 7: Fieldwork as a Computational Problem Uniting Computational

Basic Oral Language DocumentationOverview of one week's activity... Oral Annotation Protocol

Page 8: Fieldwork as a Computational Problem Uniting Computational

Transcription

Cross Checking Evaluation

• What is the quality of the collected materials?

• Can we correctly establish the phonemic inventory of the language from the recorded materials?

• What semantic domains are covered?

• What can trained linguists get from the raw transcripts?

Page 9: Fieldwork as a Computational Problem Uniting Computational

Back to the computational questions...

Axioms

• Limited funding, but costs for local participation are negligible

• Cannot assume continuous presence of a linguist: primary collection work is "unsupervised"

• Cannot assume an orthography

• Can give training in documentation, but not description

• Contact language has every conceivable resource

• No time limit

Transcription

• contact-language orthography: issues with normalisation

• lexical inventory, diphone inventory

• sense tagging

• multiple instances of one story

• ASR?

• resegmentation

• active learning in interlinear text glossing

MT to help with eliciting morphology?

• problems with recording and translating isolated words

• short complete sentences with translations

• fix nouns and vary the form of the verb?

• bilingual texts as the key means a user would train the system

Page 10: Fieldwork as a Computational Problem Uniting Computational

MT as the measure of adequacy?

• inspect MT output to see what is lost

• supply a corrected version when it gets something wrong

• supply other examples, much as you would do with a child

Data mining

Bird (1999) Multidimensional exploration of online linguistic field data. NELS 29: 33-50.