session 2 - the university of texas at arlington – ut ...€¦ · session 2 nick thieberger...

38
CoLang 2014 Data Management and Archiving Course Session 2 Nick Thieberger University of Melbourne

Upload: ngothuy

Post on 18-Sep-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

CoLang 2014 Data Management and Archiving Course

Session 2

Nick Thieberger University of Melbourne

Quiz In a morning recording session you recorded two speakers, each telling a story, then recorded your questions to both of them together to describe what they told you, now in English.

This was done in 3 audio files, 2 video files (of which the audio is also recorded) and 3 photographs. In addition you have handwritten notes.

Later a speaker (who you are paying for her assistance) will transcribe the media.

How would you prepare this dataset? What files would there be?What naming conventions would you use? How would you name the transcripts?

Briefly – Why Toolbox, or Flex?

Toolbox IGT

Id number - Media reference

Text line

Parsed Morphemic line

Gloss line

Free translation line

Parse function

Lookup function

•  Lexical database functions

Tracking text processing in Toolbox

Tracking text processing in Toolbox

How to get from the field to analysis to the archive?

•  Using good tools, for example: – Transcriber, Elan – Toolbox – Fieldworks

– Why are these good tools?

Transcription with time-alignment

•  Necessary step in building a corpus from which to make generalizations.

•  No extra cost to use time-alignment •  Index of media •  Many possible outputs •  Tools create simple text files that

encode the relationship of the text to the timecodes in the media

Other transcription tools

http://liceu.uab.es/~joaquim/phonetics/fon_anal_acus/herram_anal_acus.html

Other annotation tools Anvil ATLAS CLAN CSLU Toolkit The MATE Workbench Multitool SignStream SmartKom SyncWRITER TalkBank Transana Voicewalker

List of tools at this wiki: http://www.exmaralda.org/annotation/index.php/Main_Page

What is so good about these programs: Toolbox, Flex, Elan

etc.?

•  They produce good textual outputs

•  Simple (but structured) text can be easily converted and archived

Working form The form in which information is stored as it is

created and edited. Archival form The form in which information is stored for

access long into the future. Presentation form The form in which information is presented to

the public.

Simons (2004)

Working form The form in which information is stored as it is

created and edited.

Can include notes, not all of which may be useful later.

E.g., files being processed (in Elan, Toolbox or Flex) and ancillary files (.typ, .prj, etc) that are only necessary while working to create the annotated data.

Archival form The form in which information is stored for

access long into the future.

Highest resolution form of the data

Presentation form The form in which information is presented to

the public.

Derived from working or archival form. May be compressed and arranged to make it

easier to deliver and interpret.

Recording >Transcript

Preg nafkal skot namer nig Emlakul. NT1-98002-A 316.43 320.317

Reuse

\id 061:005 \aud NT1-98002-A 316.43 320.317\tx Preg nafkal skot nam ̃er nig Emlakul.\mr preg nafkal skot nam ̃er nig Emlakul\mg make war with people of p.name\fg To fight the people of Malakula.

Reuse Recording >Transcript > Interlinear

Book production based on IGT

Text, interlinearised in Fieldworks

Playable media, produced in Elan

Longevity of character encoding

•  Problems of legacy fonts –  fonts as variable representation of one

underlying form

•  e.g., ASCII character ‘N’ could be represented by ŋ if IPATimes was selected as the font

Need for documents to be legible!

•  Characters must be portable so that the documents are legible

•  International standard for character encoding - unicode

Unicode issues - Choose characters carefully! cf. Mia Kalish’s article in LD&C (2) - mixing different ‘code sets’ can lead to sorting issues.

à - option a à - a plus combining diacritic 0300 à - U-00E0

Using Unicode Setting up keyboards for more extensive unicode entry Windows XP: •  Tavultesoft Keyman: http://www.tavultesoft.com/keyman/ •  SIL IPA keyboard for Keyman: http://scripts.sil.org/Keyman •  Microsoft Keyboard Layout Creator: http://www.microsoft.com/

globaldev/tools/msklc.mspx

MacOS X: Ukelele: http://scripts.sil.org/ukelele Linux: Keyboard Mapping for Linux; http://kmfl.sourceforge.net/

Handy tools for inserting characters: •  http://scripts.sil.org/inputtoollinks

Using Unicode Entering characters

Windows XP: •  Windows XP character map - To Find Character Map: •  Programs > Accessories >System Tools > Character Map •  Use Alt-+-XXXX (e.g. Alt+00E9 [hex code]). •  Some applications (e.g. MS Word) support •  typing the hex code followed by Alt-x.

Mac OS X: •  Mac - character palette •  Set up the Unicode Hex Input Keyboard •  (International Preferences). •  Use Option-XXXX (e.g. Option-00E9).

Questions / discussion?

Archiving

Archives, old places with old stuff

http://www.kings.cam.ac.uk/library/archives/images/archives3.jpg

According to Jacobs & Humphrey (2004), ‘Data archiving is a process, not an end state where data is simply turned over to a repository at the conclusion of a study. Rather, data archiving should begin early in a project and incorporate a schedule for depositing products over the course of a project’s life cycle and the creation and preservation of ac- curate metadata, ensuring the usability of the research data itself. Such practices would incorporate archiving as part of the research method.’ Jacobs, James A., & Charles Humphrey. (2004). ‘Preserving Research Data.’ Communications of the ACM 47(9): 27-29.

Archiving

If we make records of languages we need some way of making sure they last

DVDs of field recordings archived directly from the northern Philippines

Reusable and interoperable research data

Why archive?

•  Our responsibility to ensure longterm access and availability of the data we record

•  For the speakers and their descendants •  Centrality of data in language

documentation

Endangered data

•  Digital word processing is our most advanced writing technology to date, but it is also the most ephemeral.

•  Hardware and software technologies are changing so rapidly that a typical storage medium or file format is obsolete within 5 to 10 years.

•  Digital records of endangered languages are in danger of dying out before the languages themselves

What should we do?

1.  Put the materials into an enduring file format.

2. Provide sufficient metadata to make the files discoverable.

3. Deposit the materials with an archive that will make a practice of migrating them to new storage media as needed.

Archiving formats

•  Text - XML, pdf, txt •  Media – Audio pcm/wav/BWF •  Images – TIF •  Video – JPEG2000, MXF

•  How do we get our data into these formats?