© 2001 hans uszkoreit vl einführung cl ws 01/02 data in linguistics linguistics -- by tradition --...

10
© 2001 Hans Uszkoreit VL Einführung CL WS 01/02 Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists have attempted to transform linguistics step by step into a science. An exact science needs formalized models and provable methods for verifying (or more often falsifying) theories. Empirical science needs a methodology of how to obtain, process, evaluate data and how to exploit data for the verification (falsification) of theories. An exact empirical science needs to establish the correspondence between data and formal models. Therefore data need to be interpreted. Quantitative data require methods and tools for measurement. In linguistics, the quantitative branch of the discipline has been disconnected from the theoretical core of the field for many decades, since quantitative linguists could not measure phenomena that were in the focus of discussion. It was language technology that finally brought them together. Example: Astronomy Photographs and spectral analyses of distant heavenly bodies are scientific data. However without their interpretation in relationship with the formal models, they are rather useless.

Upload: preston-houston

Post on 30-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: © 2001 Hans Uszkoreit VL Einführung CL WS 01/02 Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists

© 2001 Hans UszkoreitVL Einführung CL WS 01/02

Data in Linguistics

Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists have attempted to transform linguistics step by step into a science.

An exact science needs formalized models and provable methods for verifying (or more often falsifying) theories.

Empirical science needs a methodology of how to obtain, process, evaluate data and how to exploit data for the verification (falsification) of theories.

An exact empirical science needs to establish the correspondence between data and formal models. Therefore data need to be interpreted. Quantitative data require methods and tools for measurement.

In linguistics, the quantitative branch of the discipline has been disconnected from the theoretical core of the field for many decades, since quantitative linguists could not measure phenomena that were in the focus of discussion. It was language technology that finally brought them together.

Example: Astronomy Photographs and spectral analyses of distant heavenly bodies are scientific data. However without their interpretation in relationship with the formal models, they are rather useless.

Page 2: © 2001 Hans Uszkoreit VL Einführung CL WS 01/02 Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists

© 2001 Hans UszkoreitVL Einführung CL WS 01/02

Data in Linguistics

A well developed and established concept of linguistic data is still missing.

No good theory of relationship between different types of data, e.g., example sentences, online performance experiments, corpora, tree banks, test suites

However,there has been progress in several areas, e.g., evaluating acceptability judgements - methodology for subjective

rating tests. annotation, interpretation of data methods for using quantitative data in language technology

Page 3: © 2001 Hans Uszkoreit VL Einführung CL WS 01/02 Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists

© 2001 Hans UszkoreitVL Einführung CL WS 01/02

Types of Linguistic Data 1

Linguistically data are often classified into “real” and “unreal” data depending on their origin.

However, this dichotomy does not fully cover the range of possible sources.

naturally occurring data, e.g., (balanced) reference corpora specialized corpora for specific subject domains or applications incidentally diccovered linguistic examples

evoked or induced data , e.g., dialogue-scenario data wizard-of-Oz data

invented or solicited data, e.g., sample sentences created by linguists acceptability judgements solicited by linguists test suites

Page 4: © 2001 Hans Uszkoreit VL Einführung CL WS 01/02 Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists

© 2001 Hans UszkoreitVL Einführung CL WS 01/02

Types of Linguistic Data 2

The dichotomy real and unreal does not necessarily coincide with the property of naturalness.

Linguistic examples are often considered “unnatural”. On the other hand, a large corpus may contain many sentences that are extremely unnatural.

Naturalness does not solely depend on the origin of the data.

Page 5: © 2001 Hans Uszkoreit VL Einführung CL WS 01/02 Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists

© 2001 Hans UszkoreitVL Einführung CL WS 01/02

Concept of Linguistic Data

If we view linguistics as an empirical science, pieces of linguistic knowledge have to be abstractions over linguistic data.

These abstractions are parts of our theories about contents and structure of linguistic competence and about the processes, constraints and preferences that govern linguistic performance.

Linguistic data are individual utterances, parts of utterances or collections of utterances in a certain human language (or several languages). The utterances may be represented in written or spoken form (or signed), i.e., as textual or acoustic signals.

Page 6: © 2001 Hans Uszkoreit VL Einführung CL WS 01/02 Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists

© 2001 Hans UszkoreitVL Einführung CL WS 01/02

Annotated Data

Usually these collected utterances are annotated by additional information. If the annotation does not contain a partial linguistic interpretation of the utterances, the annotations may be considerd part of the data. Annotations that do not include linguistic interpretation are, e.g.:

judgements of native speakers on the acceptability or appropriateness of the utterance, information on speaker(s), information on hearer(s) or intended audience, information on the utterance situation (time, place, circumstances) information on the published source, typographic information, layout and document structure, textual transcriptions of spoken utterances, transcription of pauses.

Page 7: © 2001 Hans Uszkoreit VL Einführung CL WS 01/02 Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists

© 2001 Hans UszkoreitVL Einführung CL WS 01/02

Interpreted Data

Annotations involving a partial linguistic interpretation are, e,g.:

part-of-speech tags, word sense information, morphosyntactic features of words, constituent structures for phrases or sentences, coreference markers, dependency structures, predicate-argument structures, reference identifications for term phrases, information structures within sentences, intonation contours, speech acts, discourse structures.

Page 8: © 2001 Hans Uszkoreit VL Einführung CL WS 01/02 Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists

© 2001 Hans UszkoreitVL Einführung CL WS 01/02

Parameters for Classification

language: Spanish, English, Germansublanguage/register: regional dialect, sociolect, vernacular, professional jargon,

toddler speechtext sort(s): newspaper articles, wire news, political speech, control commandssubject domain: stock rates, flight reservations, type of producers: professional journalist, student, radiologistmode of production: spoken, written, signed, morsedmedium of production: pencil, PC with MS Word, dictaphone conditions of production: spontaneous, carefully composed, produced under time

pressuretransmission encoding: raw ascii code, HTML, digitized phone signal, unicodemedium of transmission: telephone, WWW, CB radiostorage encoding: raw ASCII code, HTML, AIFFmedium of storage: DAT tape, CD ROM, hard diskmode of presentation: spoken, written, signed medium of presentation: newspaper, radio, book, tv show, theater performance, type of intended recipients: newspaper reader, booking agent, theater audiencenumber of intended recipients: point-to-point, multicast, broadcastsynchronicity of discourse: synchronous dialogue, asynchronousdirection: one-way, two-way

Page 9: © 2001 Hans Uszkoreit VL Einführung CL WS 01/02 Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists

© 2001 Hans UszkoreitVL Einführung CL WS 01/02

Criteria for Usefulness

In order to be useful, data have to be representative.

representative of a certain linguistic phenomenon, representative of a certain text sort, representative of the expected input to some language technology

application, representative of the expected output of some language technology

application, representative of a certain speaker, etc.

Can data be representative of an entire language?

Page 10: © 2001 Hans Uszkoreit VL Einführung CL WS 01/02 Data in Linguistics Linguistics -- by tradition -- is not an exact or empirical science. Modern linguists

© 2001 Hans UszkoreitVL Einführung CL WS 01/02

Forschungsaufgaben

Rohdaten sind heute einfach zu beschaffen.

Die anspruchsvolle Aufgabe liegt in der linguistischen Interpretation.

Forschungsaufgaben:

Entwurf der Annotationsschemata für die Beschreibungsebenen Entwurf von Austauschformaten und Übersetzungswerkzeugen Entwurf und Implementierung der Werkzeuge für die Korpusannotation Teilautomatisierung der Annotation Entwurf von Methoden und Werkzeugen für die Qualitätssicherung Entwurf von Werkzeugen für die Nutzung der Daten in der Forschung

(Abruf, Auswertung, zusätzliche Dokumentation) Entwurf von Werkzeugen für die Nutzung der Daten für die

Anwendungsentwicklung (Methoden und Werkzeuge für das „Training“)