1 towards a methodology for constructing and annotating historical corpora martin durrell, paul...

30
1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

Upload: mona-kraemer

Post on 06-Apr-2016

226 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

1

Towards a Methodology for Constructing and

Annotating Historical Corpora

Martin Durrell, Paul Bennett & Astrid Ensslin

Page 2: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

2

Talk outline

A. Background to corpus B. Early Modern German newspapers C. Methodology

Page 3: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

3

A. Background to corpus

Page 4: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

4

GerManC project

Pilot (2006-07): corpus of German newspapers 1650-1700; 1701-1750; 1751-1800 NG, WCG, ECG, WUG, EUG 2,000 word samples ca. 100,257 words in total

ESRC (Economic and Social Research Council) Feasibility of

Text compilation TEI annotation Lemmatising and POS tagging software usage / modification Possible utilisation for other historical languages

Page 5: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

5

Preliminary findings

Data retrieval successful– Abundance in NG – Difficulties with WUG / Catholic regions

Regional text distribution does not reveal corresponding regional variation / linguistic norms

– Cf. correspondents from elsewhere, un-edited contributions NG newspapers more supraregional than SG Clearly marked regional variants (Samstag vs.

Sonnabend) in same text

Page 6: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

6

Page 7: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

7

Polenz II: 18

„Zeitungen wurden so – nach der Luther-Bibel – auch zum wirksamsten Mittel der Popularisierung und Verbreitung einheitlicher Sprachvarianten auf dem Wege zur nationalen Schriftsprache“

(Newspapers thus came to be, after Luther’s Bible translation, the second most effective means of popularising and disseminating linguistic variation during the standardisation of German)

Page 8: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

8

B. Early Modern German newspapers

Page 9: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

9

Early developments

Aviso, Wolfenbüttel: 1609 Official postal system Rapid growth:

– 1648: ca. 48– 1700: 60-90– 1750: 100-120– 1789: ca. 200

Mainly weekly; first daily: 1660 (Leipzig)

Page 10: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

10

A new ‘genre’

By 1700: consistent features – distinct register

Informative / objective -> personal / commenting

Dissemination: local; mostly urban middle classes and tradespeople (but awareness amongst manual classes)

Read aloud to groups

Weber, 2005

Page 11: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

11

Lexis

Abundance of loan words– Military and warfare: French/Italian (attaquiren,

mainteniren, susteniren..)– Education, religion and law: Latin

Appropriations Quotations Inflections

=> Mostly educated readers

Page 12: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

12

Syntax

Admoni (1980: 35): ‘abperlend’ Information structure > syntactic coherence Subclauses: verb-final Auxiliaries omitted Future research: (shallow) parsing

Cf. Demske et al., 2004; Demske, 2006; Demske-Neumann, 1990

Page 13: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

13

Register

No time for editing Sequentialisation (non-chronological) Compressed into huge, complex sentences Only common denominator of all subclauses:

provenance of report Wide range of topics within individual

sentences

Page 14: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

14

‘Orthography’

Major challenge for electronic processing Extremely variable (esp. 1650-1700) Variation decreases Not random: ey-ei (beym), ff-f (auff), Londen-

London

Page 15: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

15

Punctuation

Virgula tends to replace comma / full stop / colon / semicolon

Non-syntactic, prosodic, rhetoric Non-systematic (e.g. full stop not always

marking sentence boundaries) Example:…

Page 16: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

16

Die Zeitungen Der Gelehrten Auß dem Schweitzerlande, Zürich/Schaffhausen 1722, Num. XI, p. 183

Das andre bemühet seyen der ihrigen nachzudencken, ist wahr/ denn obschon die Glieder der Gesellschaft an sich bisher verborgen gewesen/ so hat doch Melissantes ihren Humor und ihre Conduite kräfftig entworffen; aber positiv ist noch nicht zu sagen/ wie Geistreich die Personen seyen/ die euch mit eben so lebhafften Gedancken betrachten, als ihr andre vorzustellen fähig, denn eure Fähigkeit ist uns noch nicht genugsam bekannt; Man hat schon viel davon erfahren/ hoffet gleichwol allezeit/ das Beste werde noch folgen/ und die werden schlecht bestehen/ die euch nicht wol wollen.

Page 17: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

17

C. Methodology

Page 18: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

18

Breslau 1683

Vienna 1780

Page 19: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

19

Halle 1724

Frankfurt 1750

Munich 1702

Lindau 1685

Page 20: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

20

Digitisation: raw corpus

Manual transcription (scanning infeasible) Double-keying Text comparison Omitting

– long passages in foreign languages and non-prose (verse, tables, graphs)

– illegible / damaged passages Normalised: superscript-e, nasal bar, long/final ‘s’ Corpus documentation (throughout)

Page 21: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

21

XML-annotation

TEI U5 standards (Burnard & Sperberg-McQueen, 2002) Exchanger XML Editor CLaRK: automatic conformance checking Header data (administrative metadata) Markup (see example):

– Loans / foreign languages– Names / referring strings– Numbers / dates / times– Graphic features (images, lines, ornaments etc.)– Header / footer– Abbreviations incl. expansions– Special characters (nasal bar, superscript-e), ligatures, diacritics– Formatting (fonts, paragraphs, sentences, quotes, line/page breaks)

Page 22: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

22

Erfurt1769.xml

Page 23: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

23

Software development

Previous research– VARD (English) (Rayson et al., 2005)– Mercurius Treebank (EMG) (Demske et al., 2004; Demske,

2006)– Variant retrieval (Ernst-Gerlach; Fuhr; Pilz; Hauser…)

GerManC work in progress– TreeTagger adaptation (tokenisation, lemmatisation, POS)

Adding to lexicon Abbreviation tokeniser

– Variant normalisation program (VARD-based) with stoplist

– Example:…

Page 24: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

24

TreeTagger output for Altonaischer Mercurius, 15 November 1698

Tagger: ca. 85% accuracy

Lemmatiser: ca. 80% accuracy– Spelling!

Unknown words can easily be added to the tagger’s lexicon.

(extract)

Page 25: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

25

Further software development

Lexical searches1. Word lists + source file(s)

Page 26: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

26

Further software development

2. Frequency lists (also for bi-/trigrams)

3. First/last dates of usage 4. All words occurring in

one file only

Page 27: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

27

Applications

Historical sociolinguistics– ‘Tracing variation in

standardisation: a corpus-based approach’

Cultural/media studies– Anglo-German linguistic

relations– ‘”Im Unterhause groß

Getöse”: representations of 18th century British parliamentary democracy in Early Modern German newspaper discourse’

Page 28: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

28

GerManC2 – future plans

+ 700,000 words + 7 genres

– ‘oral’: personal letters, drama, sermons– ‘written’: academic writing, legal documents, prose

fiction, medical texts Multi-layer stand-off annotation / relational

database architecture Representation / visualisation / web interface Feasibility of parsing (e.g. shallow parsing)

Page 29: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

29

References

Admoni, Wladimir (1980) Zur Ausbildung der Norm der deutschen Literatursprache im Bereich des neuhochdeutschen Satzgefüges (1470 - 1730). Ein Beitrag zur Geschichte des Gestaltungssystems der deutschen Sprache. Berlin: Akademie-Verlag.

Burnard, Lou & Sperberg-McQueen, C.M. (2002) TEI U5: Encoding for Interchange: an introduction to the TEI < http://www.tei-c.org/Lite/teiu5_en.xml> (10/11/06).

Demske, Ulrike (2006, forthcoming) ‘Das Mercurius-Projekt. Eine Baumbank für das Frühneuhochdeutsche’, in: G. Zifonun & W. Kallmeyer (eds) Jahrbuch des Instituts für deutsche Sprache 2006. Berlin: de Gruyter.

Demske, Ulrike, Frank, Nicola, Laufer, Stefanie & Stiemer, Hendrik  (2004) ‘Syntactic Interpretation of an Early New High German Corpus’, in S. Kübler et al. (eds) Proceedings of the Third Workshop on Treebanks and Linguistics Theories (TLT 2004), pp. 175-182. Tübingen, available at www.sfs.uni-tuebingen.de/tlt04/ (10/11/06).

Demske-Neumann, Ulrike (1990) ‘Charakteristische Strukturen von Satzgefügen in den Zeitungen des 17. Jh.‘, in A. Betten (ed) Neuere Forschungen zur historischen Syntax des Deutschen. Referate der Internat. Fachkonferenz Eichstätt 1989, pp. 239-252. Tübingen: Niemeyer.

Ensslin, Astrid, Durrell, Martin & Bennett, Paul (2006) ‘Tracing Variation in Standardisation: a Corpus-based Approach’, available at <www.llc.manchester.ac.uk/Research/Projects/GerManCproject/thefile,73645,en.pdf> (27/11/06)

Ernst-Gerlach, Andrea & Fuhr, Norbert (2006) ‘Generating Search Term Variants for Text Collections with Historic Spellings’, in Proceedings of the 28th European Conference on Information Retrieval Research (ECIR 2006), available at <www.is.informatik.uni-duisburg.de/bib/pdf/ir/Ernst_Fuhr:06.pdf> (13/04/06).

Page 30: 1 Towards a Methodology for Constructing and Annotating Historical Corpora Martin Durrell, Paul Bennett & Astrid Ensslin

30

References

ESRC (Economic and Social Research Council) <www.esrc.ac.uk/ESRCInfoCentre/index.aspx> (10/11/06).

Kytö, M. (1996) ‘Manual to the Diachronic Part of the Helsinki Corpus of English Texts’, 3rd edition. University of Helsinki: Department of English <http://icame.uib.no/hc/index.htm> (10/11/06).

Pilz, Thomas, Luther, Wolfram, Ammon, Ulrich & Fuhr, Norbert (2005) ‘Regelbasierte Suche in Textdatenbanken mit nichtstandardisierter Rechtschreibung’, in Proceedings ACH/ALLC 2005, Victoria, 15 - 18 Jun 2005, available at <www.is.informatik.uni-duisburg.de/bib/pdf/ir/Pilz_etal:05.pdf> (13/04/06).

Polenz, Peter von (1994) Deutsche Sprachgeschichte vom Spätmittelalter bis zur Gegenwart, Band II, 17. und 18. Jahrhundert, Berlin: de Gruyter.

Rayson, Paul, Archer, Dawn & Smith, Nicholas (2005) ‘VARD versus Word: A Comparison of the UCREL Variant Detector and Modern Spell Checkers on English Historical Corpora’, in Proceedings of Corpus Linguistics 2005, Birmingham University, July 14-17, Proceedings from the Corpus Linguistics Conference Series on-line e-journal, vol. 1, no. 1, available at <www.comp.lancs.ac.uk/computing/users/paul/publications/cl2005_vardword.pdf> (13/04/06).

Stolt, Birgit (1990) ‘Redeglieder, Informationseinheiten: Cola und Commata in Luthers Syntax’, in A. Betten (ed) Neuere Forschungen zur historischen Syntax des Deutschen. Referate der Internat. Fachkonferenz Eichstätt 1989, pp. 379-392. Tübingen: Niemeyer.

Weber, Johannes (2005) ‘Straßburg 1605: Die Geburt der Zeitung’, in H. Böning, A. Kutsch & R. Stöber (eds) Jahrbuch für Kommunikationsgeschichte, vol.7, pp. 3 – 26, Stuttgart: Franz Steiner Verlag.