body of language data various types

21

Upload: others

Post on 16-Feb-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

2

Body of language data Various types Spoken Text Images Gestures Structured (aligned, annotated, treebanks, etc.)

Very valuable resource for linguist(ic)s

3

Language instruction Computer systems development Training, testing/evaluating systems

Task analysis Knowledge source development

(dictionaries, lexicons, etc.)

4

alumino silicate fibre holding an helical wire set in grooves inner e target have been determined. An ytterbium target of 4 g/cm{sup 2} ha ted receptor with borohydride, an 3H-labeled alcohol is released, sugg from the deposition chamber to a UHV chamber equipped with Auger spec king the donor nitrogen atoms. An x-ray diffraction structural analysi s discussed. An application of an hydrodynamic study in the North Sea his value is then corrected by an magnification factor called Ke that es and concludes that they are an wholly inadequate response to the ron sputtering. Preparation of an Y-Ba-Cu-O film directly on MgAl{sub s radiographic sign appears as an horizontal line between two soft onnecting the gas supply lines an gas evacuation lines to each of the . The burners were fired using a UK coal (pulverised at CERCHAR) and, e system, octopus rhodopsin is an 11-cis pigment, while the photoprodu low influenced the mobility of an herbicide which was adsorbed by the pensation, i.e. <e/h>=0.76, at an hadronic energy resolution of {sigma he structure of earthen seals. An saturated environment will need to b ray and gamma-ray observations an substantially underestimate the spec erature in the same way as for an homogeneous dirty type II supercondu great reduction in their cost, an great increase in electricity rates,

Marking up corpora for interesting information

Requires expertise Costly, time consuming Valuable information can be mined Word frequency, distribution Document classification etc. etc.

Text Speech Discourse Bitext Experimental transcripts Competition datasets

7

Electronic text centers Digital libraries Project Gutenberg AlexLit Bibliomania

Corpus collections Wiki The web

9

Compiled and deployed for specific purposes/audiences Genres Language change Dialects Usage Training machines

11

Text File formats: ASCII, EBCDIC, UNICODE, proprietary With or without markup (rtf, html, etc.) Application specific (doc, wpd, etc.) Can vary widely across languages

Speech Huge amount of variation across projects/hw/sw TIMIT, NIST (US Gov.), AIFF (Apple), SUNAU8 (Sun), OGI

File Format, WAV (Microsoft) Binary/machine formats Sound/speech: MP3, AU, WAV, RA, … Graphical: GIF, JPEG, BMP, WMF, …

Knowledge of a scripting language (e.g. Perl) is

invaluable!

12

One of most important tools in linguistics

Supported by most programming languages Future: in search engines too

Tutorials, documentation abound Alas, various implementations have

slightly differing conventions

Frequency Genre Size Dispersion

16

Written only: aéronautique, bouleversement, coïncider, crépuscule, guérilla, itinéraire, jadis, laïque, logistique, météorologique, microphone, solennel

Spoken only: abusif, allô, bah, cafard, cingler, clown, copine, dingue, flic, fric, hockey, lucratif, machin, météo, micro, ouais, porno, sexy, sympa

JLC 6

Date: Thu, 21 Feb 2013 10:40:22 University or Organization: H5 Job Location: California, USA Web Address: http://www.h5.com Job Rank: Consultant Specialty Areas: Discourse Analysis; Semantics; Syntax; Text/Corpus Linguistics About H5: H5 serves the needs of leading law firms and corporate clients, using powerful proprietary software to provide technology-assisted review and expert search consulting & research. H5’s document review and analytic services uniquely support our clients’ requirements for large-scale litigation, investigation, records retention, and regulatory compliance. H5’s "hybrid" approach to technology-assisted review combines patented information retrieval technology and expert professional services. Through this model, H5 has created a fully integrated document review system that is unparalleled in performance, as proven in independent, benchmarked studies. For more information, visit www.h5.com. Overview: The H5 Professional Services Group includes linguists, lawyers, researchers, statisticians, e-discovery and data modeling experts and project managers. Our multidisciplinary teams use H5’s proprietary software and a well-defined process to build linguistic models that classify electronic data and support strategic search for documents that help our clients win. H5 is seeking candidates with backgrounds in linguistics (or related fields of textual corpus analysis), an affinity for developing novel search strategies, and a desire to collaborate with professional teams and sophisticated search technologies. Primary Responsibilities: - Analyzing linguistic data; - Researching large corpora for linguistic patterns; - Creating search strategies based on linguistic patterns; - Researching subject matter and factual issues in complex litigation; - Rapidly developing an understanding of new subject matter; - Reading a wide variety of documents, from e-mail to academic articles; - Synthesizing large amounts of information from a variety of sources; - Designing, building, and testing search models unique to each project. Key Competencies: - Understanding of syntax, semantics, and pragmatics, in written communication; - Experience in corpus, text, or discourse analysis a plus; - Experience in ethnography or anthropology can be helpful, particularly as it relates to an understanding of contextual cues in text-based communication; - Leadership skills, personal incentive and a demonstrated ability to initiate, develop, and successfully conclude projects; - A sharp eye for detail and precise thinking; - The ability to make analytical judgments; - A practiced sense of order and organization; - Ability to work under pressure and meet deadlines, both autonomously and collaboratively; - Strong interpersonal skills, flexibility, curiosity, creativity, and collaborative spirit; - Strong computer and software competency in a PC/Windows environment, including Microsoft Office; - Experience in a software development environment a plus. Minimal Qualifications: - Solid academic credentials: advanced-undergraduate and/or graduate-level coursework in linguistics, textual corpus analysis, or related field; - Experience applying linguistic and search expertise to real language data; - Experience in a professional or business environment; - Mastery of the English language.

18

Lots of corpora listed here that are available for BYU faculty/student use.

corpus.byu.edu scriptures.byu.edu General Conference corpus

19

Avoid duplication of effort Allow synergy, integration, exchange Specific goals Reusable text and tagging formats Representative of domain/discipline/genre Copyright-free

20

SGML (ISO standard) Standard Generalized Markup Language DTD, XOM, etc.

HTML (W3C standard) Hypertext Markup Language SGML with specific DTD

XML (W3C standard) Logical SGML subset replacement (?) for HTML

21

TEI Text Encoding Initiative Based on SGML DTD contains over 400 elements ▪ Most syntactic features

TEILITE is a commonly-used logical subset