spanish in the u.s.: developing an open linguistic corpus

57
Spanish in the US: Developing an open linguistic corpus Barbara E. Bullock & Almeida Jacqueline Toribio 24 th Conference on Spanish in the United States 9 th Conference on Spanish in Contact with Other Languages March 6-9, 2013, McAllen, Texas

Upload: spanish-in-texas-project

Post on 21-Jun-2015

291 views

Category:

Documents


3 download

DESCRIPTION

Presentation by project directors Barbara E. Bullock and Almeida Jacqueline Toribio at the 24th Conference on Spanish in the United States, March 2013 in McAllen, Texas.

TRANSCRIPT

  • 1. Spanish in the US:Developing an open linguistic corpus Barbara E. Bullock & Almeida Jacqueline Toribio24th Conference on Spanish in the United States9th Conference on Spanish in Contact with Other Languages March 6-9, 2013, McAllen, Texas

2. Spanish in Texas Corpus Project Purpose: to make publically availableauthentic data about variation in Spanish asspoken in Texas for education for research 2 3. Motivation Document continuity and variation Understand variation in its local context Overcome the challenges of studyingnaturalistic data Cost: gathering, transcribing and coding data Accountability: corpora upon which studies arebased are rarely made available to the public Encourage teachers/students/public to viewlocal varieties as a resource 3 4. Inspiration Garland Bills, Vivian Cook, Lourdes Ortega,Ricardo Otheguy, Bonnie Urciuoli, GuadalupeValds, Walt Wolfram, Ana Celia Zentella, language variation in the public interest an empirical turn in thinking about contact varieties Ornstein-Galicias (1981) call: investigate Spanish varieties in your own backyard, share resources, create concordances of usage 4 5. Impetus What is needed large, representative samples of oral Spanish inthe U.S. metadata about the speakers a context and protocols for sharing architecture,scripts, analytical techniques, and data as well as findings5 6. Why open? To facilitate access attract as many eyes as possible to the same data accelerate the production of findings, which isparticularly important for the study of U.S.Spanish To reduce costs in terms of time and money,especially for those who can least afford it6 7. But Large corpora are of limited utility tountrained end users Teachers need short videos that are appropriatefor classroom use And teachers need tools to easily search videos, to author materials, to curate their own collections7 8. Two-pronged approach Spanish in Texas Corpus Project Video interviews that provides rich content SpinTX: Corpus-to-Classroom Collection of pre-selected, corrected, annotatedclips from the larger corpus Open-source, pedagogically-friendly search andauthoring tools 8 9. Goals of this talk Document our efforts to develop an open corpus ofU.S. Spanish, using open-source tools Define open Describe the protocols that we are using for to convertSpanish in TX interviews to pedagogically useful corpus Showcase materials and tools that we have for use Share our work with others who may be interested indeveloping open Spanish in X corpora Forecast to an open sociolinguistic/computationalresearch corpus of the full interviews of Spanish in TX9 10. Origins of the project Language Resource Center[LRC], 2010-2013 Center for Open Educational Resources forLanguage Learning [COERLL]10 11. Open Educational Resources [OER] Educational material offered freely for anyone to use, typically involving some permission to remix, improve, and redistribute creativecommons.org11 12. Spanish in Texas Media License Attribution Required Non-Commercial Share-Alike 12 13. Spanish in Texas Corpus Project13 14. Spanish in Texas Corpus Project Spanish in Texas is our first collection of videointerviews provides content for SpinTX Corpus-to-Classroom Additional collections Spanish in Texas CS collection Hindi-English CS collection14 15. Spanish in Texas Corpus Project Ideally serve as a reference corpus for oralSpanish in Texas large (1 million + words), representative ofvariation, fully open currently 134 interviews; approx. 600,000 words This will help establish a better baseline forheritage language research and teaching thanthe traditionally assumed monolingual one15 16. From Corpus to Classroom 16 17. SpinTX: Corpus-to-Classroom Aims develop a pedagogically friendly interface forusing the corpus involve teachers and learners, via crowd-sourcing,social networking, and workshops, in thedevelopment of open educational resources create a model for using open source tools and apedagogical interface that can be adapted for anylanguage corpus collection 17 18. Funding Department of Education, Title VI College of Liberal Arts Longhorn Innovation Fund for Technology[LIFT]18 19. Our team Directors: Barbara E. Bullock & Almeida J. Toribio Project Manager and Web Architect: Rachael Gilg Consultants Graphic Designer: Nathalie Steinfeld Childre Computational Linguist: Mart Quixal Digital Media Producer: Scott Ziga Educational Technologist: Arthur Wendorf Outreach Coordinator: Jeffrey Michno Content Manager, Intern Coordinator: Jacqueline Larsen Serigos Materials Development: Jesse Abing, Joshua Frank Undergraduate Interns 2011-present: 1219 20. Our team Collaborators University of Texas Pan American Jos Esteban Hernndez Stephanie Brock Jos Flores Viridiana Gallegos Rossy Limas Michelle Madrid Texas A&M International University Patricia Gonzlez Conchita Hickey Lisa Flores Others Daniel Villa, New Mexico State University Mara Irene Moyna, Texas A&M University MaryEllen Garca, University of Texas, San Antonio Jens Clegg, Indiana University-Purdue University Abby Dings, Southwestern University20 21. La gente puede ser pobre, pero compra su coca-cola21 22. From Community to Corpus 23. Recruit locally Recruit and train interns Internal Review Board training Video shooting and audio recording Practice interviews on site Recruit family, friends, acquaintances Any Spanish-speaking resident of TX Conduct interviews in their homecommunities 23 24. Video Production Protocol HD video cameras Professional quality condenser microphones interviewer and interviewee are each recordedinto a separate channel Interviewer wears headphones to monitoraudio24 25. Interview protocol Sampling of a large set of questions (~75) from NPR Storycorps (Historias) biographical information Average Length: 30-45 min. Language: Spanish and mixed Consent form and talent release Metadata on speaker and interviewer Google docs 25 26. Interview Metadata 26 27. Processing the Videos Intake interview materials create unique ID for video and forms archive raw video and remove from camera Video and transcript preparation Final Cut Pro Upload to Automatic Sync (3-5 day turnaround) convert transcript to UTF-8, upload to Google Drivecollection upload to Youtube to create synced caption file (SRT)27 28. From Transcript to Corpus 29. Original Transcript (from Automatic Sync)29 30. Review Transcript in Google Docs 30 31. Prepare Transcript for TreeTagger31 32. Run Transcript through TreeTaggerSpanishEnglish32 33. Upload Video and Transcript to YouTube 33 34. Download SRT File34 35. Combine Data from SRT File andTreeTagger File, and add additional Tags 35 36. Divide CSV Files and Videos into Clips and adjust Timings and Numberings 36 37. From Corpus to Clips Archive 38. Selecting Clips38 39. Topic List (Manual) Abuelos Herencia familiar Amigos Identidad Amor y relaciones Idioma Comida La infancia Criando Hijos Matrimonio Cultura Padres Escuela Religin Familia Texas Futuro Trabajo39 40. Automatic Pedagogical Annotations40 41. Manual Coding for Complex Cases Annotation of lo as an article that allows for theelision of nouns as in lo bueno de esta clase es The rule requires a sequence of two words: lo followed by an adjective with some words in between (in fact only adjective modifiers, as adverbs, since the BARRIER operator is telling the scanning process to stop if a typical NP boundary is crossed.41 42. Automatic annotation levels for clips Grammatical aggregated from textbooks Functional greetings, ask for help, express opinions Pragmatics discourse markers, place holders (este),attenuators Bilingual forms CS, loans, loan translations 42 43. Keyword Search 43 44. Filtering by Tags and Metadata 44 45. Video Page 45 46. Annotation Highlighter 46 47. Cloze Test Generator 47 48. Word cloud visualization tool48 49. OER Materials Spanish in Texas searchable clip corpusavailable this spring approximately 500 clips and growing All specially created code scripts are availablenow through GitHub IRB, talent release, google metadata surveytemplate, etc. available 49 50. OER Materials In spirit of OER, please share-alike Add to repository any pedagogical materialsyou or your students might develop fromSpanish in Texas clip corpus50 51. Classroom and Community We are designing the corpus and tools withthe end-users using locally-relevant language samples toillustrate every aspect of Spanish Users model their own language forpedagogical purposes The corpus is the textbook 51 52. SPinTX: Corpus-to-Scholarship (the future)52 53. SPinTX: Corpus-to-Scholarship Full interviews, video-taped, captioned, POStagged will be made available Syntactically-parsed corpora Additional public protocols, open-sourcesearch tools 53 54. Corpus-to-Scholarship: Share-alike When you use the corpus, share-alike crowd-sourcing approach to additional annotationlevels (e.g., PRAAT text grids) well use stand-off annotation sociolinguists would ideally share data coding corpus linguists would ideally share scripts Any users could contribute their collections:video, transcript, and metadata well run it through SpinTX processing 54 55. Archive Spanish in Texas Corpus to be archived at theNettie Lee Benson Latin American Collection,University of Texas Libraries55 56. WebsitesProject website:http://sites.la.utexas.edu/spanishtx/Corpus-to-Classroom Blog:http://sites.la.utexas.edu/corpus-to-classroom/Facebook page:https://www.facebook.com/spanish.in.texas56 57. Thanks To all of our collaborators Especially to our students and their friends,neighbors, and families who shared their timeand their language with us57