session 2 - the university of texas at arlington – ut ...€¦ · session 2 nick thieberger...
TRANSCRIPT
Quiz In a morning recording session you recorded two speakers, each telling a story, then recorded your questions to both of them together to describe what they told you, now in English.
This was done in 3 audio files, 2 video files (of which the audio is also recorded) and 3 photographs. In addition you have handwritten notes.
Later a speaker (who you are paying for her assistance) will transcribe the media.
How would you prepare this dataset? What files would there be?What naming conventions would you use? How would you name the transcripts?
Toolbox IGT
Id number - Media reference
Text line
Parsed Morphemic line
Gloss line
Free translation line
How to get from the field to analysis to the archive?
• Using good tools, for example: – Transcriber, Elan – Toolbox – Fieldworks
– Why are these good tools?
Transcription with time-alignment
• Necessary step in building a corpus from which to make generalizations.
• No extra cost to use time-alignment • Index of media • Many possible outputs • Tools create simple text files that
encode the relationship of the text to the timecodes in the media
Other transcription tools
http://liceu.uab.es/~joaquim/phonetics/fon_anal_acus/herram_anal_acus.html
Other annotation tools Anvil ATLAS CLAN CSLU Toolkit The MATE Workbench Multitool SignStream SmartKom SyncWRITER TalkBank Transana Voicewalker
List of tools at this wiki: http://www.exmaralda.org/annotation/index.php/Main_Page
What is so good about these programs: Toolbox, Flex, Elan
etc.?
• They produce good textual outputs
• Simple (but structured) text can be easily converted and archived
Working form The form in which information is stored as it is
created and edited. Archival form The form in which information is stored for
access long into the future. Presentation form The form in which information is presented to
the public.
Simons (2004)
Working form The form in which information is stored as it is
created and edited.
Can include notes, not all of which may be useful later.
E.g., files being processed (in Elan, Toolbox or Flex) and ancillary files (.typ, .prj, etc) that are only necessary while working to create the annotated data.
Archival form The form in which information is stored for
access long into the future.
Highest resolution form of the data
Presentation form The form in which information is presented to
the public.
Derived from working or archival form. May be compressed and arranged to make it
easier to deliver and interpret.
\id 061:005 \aud NT1-98002-A 316.43 320.317\tx Preg nafkal skot nam ̃er nig Emlakul.\mr preg nafkal skot nam ̃er nig Emlakul\mg make war with people of p.name\fg To fight the people of Malakula.
Reuse Recording >Transcript > Interlinear
Longevity of character encoding
• Problems of legacy fonts – fonts as variable representation of one
underlying form
• e.g., ASCII character ‘N’ could be represented by ŋ if IPATimes was selected as the font
Need for documents to be legible!
• Characters must be portable so that the documents are legible
• International standard for character encoding - unicode
Unicode issues - Choose characters carefully! cf. Mia Kalish’s article in LD&C (2) - mixing different ‘code sets’ can lead to sorting issues.
à - option a à - a plus combining diacritic 0300 à - U-00E0
Using Unicode Setting up keyboards for more extensive unicode entry Windows XP: • Tavultesoft Keyman: http://www.tavultesoft.com/keyman/ • SIL IPA keyboard for Keyman: http://scripts.sil.org/Keyman • Microsoft Keyboard Layout Creator: http://www.microsoft.com/
globaldev/tools/msklc.mspx
MacOS X: Ukelele: http://scripts.sil.org/ukelele Linux: Keyboard Mapping for Linux; http://kmfl.sourceforge.net/
Handy tools for inserting characters: • http://scripts.sil.org/inputtoollinks
Using Unicode Entering characters
Windows XP: • Windows XP character map - To Find Character Map: • Programs > Accessories >System Tools > Character Map • Use Alt-+-XXXX (e.g. Alt+00E9 [hex code]). • Some applications (e.g. MS Word) support • typing the hex code followed by Alt-x.
Mac OS X: • Mac - character palette • Set up the Unicode Hex Input Keyboard • (International Preferences). • Use Option-XXXX (e.g. Option-00E9).
Archives, old places with old stuff
http://www.kings.cam.ac.uk/library/archives/images/archives3.jpg
According to Jacobs & Humphrey (2004), ‘Data archiving is a process, not an end state where data is simply turned over to a repository at the conclusion of a study. Rather, data archiving should begin early in a project and incorporate a schedule for depositing products over the course of a project’s life cycle and the creation and preservation of ac- curate metadata, ensuring the usability of the research data itself. Such practices would incorporate archiving as part of the research method.’ Jacobs, James A., & Charles Humphrey. (2004). ‘Preserving Research Data.’ Communications of the ACM 47(9): 27-29.
DVDs of field recordings archived directly from the northern Philippines
Reusable and interoperable research data
Why archive?
• Our responsibility to ensure longterm access and availability of the data we record
• For the speakers and their descendants • Centrality of data in language
documentation
Endangered data
• Digital word processing is our most advanced writing technology to date, but it is also the most ephemeral.
• Hardware and software technologies are changing so rapidly that a typical storage medium or file format is obsolete within 5 to 10 years.
• Digital records of endangered languages are in danger of dying out before the languages themselves
What should we do?
1. Put the materials into an enduring file format.
2. Provide sufficient metadata to make the files discoverable.
3. Deposit the materials with an archive that will make a practice of migrating them to new storage media as needed.