impact final conference - language parallel sessions - gotscharek

21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 15. 10. 2011, IMPACT Conference Special resources to access 16th century German Ludwig-Maximilians-Universität München Annette Gotscharek

Upload: impact-centre-of-competence

Post on 15-Jun-2015

1.791 views

Category:

Education


1 download

TRANSCRIPT

Page 1: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference

Special resources to access 16th century German Ludwig-Maximilians-Universität München

Annette Gotscharek

Page 2: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 2

Special resources to access 16th century German“access”?

OCR: Role of the lexicon: defines the set of valid words.

... GeistGeisterTeilegemütlich …

Information Retrieval (IR):

Role of the lexicon: meaningful expansion of the user query to increase recall.

... Geist Geister, Geiste, GeisternTeil Teile, Teils, Teilengemütlich gemütlicher, gemütlichste ...

Page 3: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 3

Special resources to access 16th century German In IMPACT, we worked on documents from 1500-1950, but 16th century is special:

– Language period: Early New High German (1350-1650)– Oldest and therefore most challenging period of printed books– Large library holdings from 16th century at our partner library BSB

linguistic features of historical language on word-level

Historic modern English– Historical spelling variation: geyſte Geiste spirit– Historical morphology: er frug er fragte he asked– Obsolete vocabulary: mirackel Wunder (?) miracle– Obsolete character set: aͤ� ä …

Need adapted linguistic resources

Page 4: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 4

Adapted linguistic resources: structure OCR:

... GeistGeisterTeilegemütlich …

Information Retrieval (IR):

... Geist Geister, Geiste, GeisternTeil Teile, Teils, Teilengemütlich gemütlicher, gemütlichste ...

Page 5: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 5

Adapted linguistic resources: structure OCR:

... Geist GeystGeister GeysterTeile Theilegemütlich gemüthlich …

Information Retrieval (IR):

... Geist Geister, Geiste, Geistern Geyster, Geyste, GeysternTeil Teile, Teils, Teilen Theile, Theils, Theilengemütlich gemütlicher, gemütlichste gemüthlicher, gemüthlichste...

Page 6: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 6

Linguistic Resources for Historical Texts

Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon

Page 7: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 7

Linguistic Resources for Historical Texts

Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon

Page 8: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 8

Diachronic Groundtruth Corpus (1500-1950)

Collection of groundtruth material from different sources in the web and non-public electronic corpora (Institut für Deutsche Sprache Mannheim)

Large gap especially in 16th / 17th century: with BSB: preparation of additional corpus from BSB documents:– Random selection of 100 works from digitized images of 16th and 17th century– Mostly related to theology– Latin texts excluded, no poems etc. – Keyed by a service provider– 1766 pages with ~ 858,000 tokens groundtruth material

Page 9: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 9

Diachronic Groundtruth Corpus (1500-1950) Gains of tokens by the extension of the corpus:

Complete corpus contains ~ 3,380,000 tokens in 500 texts from 4 centuries basis for different analyses and lexicon building

Page 10: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 10

Coverage on Diachronic Corpus: modern

Less than 45% of the vocabulary is covered by modern resources before 1750. 16th century: only 15% - 29% modern simple words, modern closed compounds

are hardly relevant.

1500-

1549

1550-

1599

1600-

1649

1650-

1699

1700-

1749

1750-

1799

1800-

1849

1850-

1899

1900-

1949

Modern simple

words

Modern

compounds

Types (%)

15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1

5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8

Page 11: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 11

Linguistic Resources for Historical Texts

Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon

Page 12: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 12

Hypothetical lexicon for rule based variants

Systematic substitution rules (patterns) describe the difference between modern and historical spelling:

(modern) (historic)

Based on the modern lexicon and the 140 manually collected patterns, the set of all potential rule based historical variants can be computed automatically (“hypothetical lexicon”).

theylteil eyeitht ,

Page 13: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 13

Hypothetical lexicon for rule based variants

Esel

Teil

modern

lexicon

…e →eh

ei →ey

s →ß

l→ll

t →th

Esel

Esell

Esehl

Esehll

Eßel

Eßell

Eßehll

hypothetical

lexicon

Teil

Teill

Teyl

Teyll

Tehill

Theil

pattern set

Page 14: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 14

Hypothetical lexicon for rule based variants

Automatic mapping from rule based historical variants to their equivalent in the modern vocabulary is possible:

historic modern Geyst = Geist + (ei ey) Theile = Teile + (t th)

By far not all historical variants can be described by simple replacement rules:historic modern

frug = fragte + ?Mirackel = ? + ?

Page 15: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 15

Coverage on Diachronic Corpus: hypothetic

16th century: 30% of the vocabulary are covered by the lexicon of rule based variants

Applied as OCR-Lexicon via the IMPACT Abbyy External Dictionary Interface: improvement of recognition rate (published 2009)

1500-

1549

1550-

1599

1600-

1649

1650-

1699

1700-

1749

1750-

1799

1800-

1849

1850-

1899

1900-

1949

Modern simple

words

Modern

compounds

Hypothetic

Types (%)

15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1

5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8

29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0

Page 16: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 16

Coverage on Diachronic Corpus: missing

Especially in the 16th century: Up to 46% “difficult” vocabulary. manually verified lexicon necessary!

1500-

1549

1550-

1599

1600-

1649

1650-

1699

1700-

1749

1750-

1799

1800-

1849

1850-

1899

1900-

1949

Modern simple

words

Modern

compounds

Hypothetic

Missing

Types (%)

15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1

5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8

29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0

45.9 28.7 29.7 26.0 23.5 15.1 13.9 13.5 8.1

Page 17: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 17

Linguistic Resources for Historical Texts

Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon

Page 18: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 18

Manually verified IR-lexicon: Structure

One entry contains:– Historical word form from the corpus– Corresponding modern word form– Patterns if applicable– Corresponding modern lemma– At least one occurrence in the corpus as a attestation for the reading

Manual assignment of modern word form and lemma Explicit handling of not rule based variants

Page 19: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 19

Manually verified IR-lexicon: Compilation

Web-based, collaborative user interface User support:

– For rule based variants: Suggestion of the corresponding modern word form by the hypothetic lexicon

– Suggestion of all possible lemmas for the modern word form by a large modern lexicon (CISLEX)

– Concordance list of the historical variant

Page 20: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 20

Manually verified IR-lexicon: Status

41,600 entries have been created for 24,800 historical word forms from the diachronic corpus, 72,100 attestations were annotated.

IMPACT-Partner in Slovenia und Bulgaria create corresponding lexica with an adapted version of the tool.

Page 21: IMPACT Final Conference - Language Parallel Sessions -  Gotscharek

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15. 10. 2011, IMPACT Conference 21

Thank you.