impact final conference - language parallel sessions - gotscharek
TRANSCRIPT
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference
Special resources to access 16th century German Ludwig-Maximilians-Universität München
Annette Gotscharek
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 2
Special resources to access 16th century German“access”?
OCR: Role of the lexicon: defines the set of valid words.
... GeistGeisterTeilegemütlich …
Information Retrieval (IR):
Role of the lexicon: meaningful expansion of the user query to increase recall.
... Geist Geister, Geiste, GeisternTeil Teile, Teils, Teilengemütlich gemütlicher, gemütlichste ...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 3
Special resources to access 16th century German In IMPACT, we worked on documents from 1500-1950, but 16th century is special:
– Language period: Early New High German (1350-1650)– Oldest and therefore most challenging period of printed books– Large library holdings from 16th century at our partner library BSB
linguistic features of historical language on word-level
Historic modern English– Historical spelling variation: geyſte Geiste spirit– Historical morphology: er frug er fragte he asked– Obsolete vocabulary: mirackel Wunder (?) miracle– Obsolete character set: aͤ� ä …
Need adapted linguistic resources
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 4
Adapted linguistic resources: structure OCR:
... GeistGeisterTeilegemütlich …
Information Retrieval (IR):
... Geist Geister, Geiste, GeisternTeil Teile, Teils, Teilengemütlich gemütlicher, gemütlichste ...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 5
Adapted linguistic resources: structure OCR:
... Geist GeystGeister GeysterTeile Theilegemütlich gemüthlich …
Information Retrieval (IR):
... Geist Geister, Geiste, Geistern Geyster, Geyste, GeysternTeil Teile, Teils, Teilen Theile, Theils, Theilengemütlich gemütlicher, gemütlichste gemüthlicher, gemüthlichste...
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 6
Linguistic Resources for Historical Texts
Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 7
Linguistic Resources for Historical Texts
Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 8
Diachronic Groundtruth Corpus (1500-1950)
Collection of groundtruth material from different sources in the web and non-public electronic corpora (Institut für Deutsche Sprache Mannheim)
Large gap especially in 16th / 17th century: with BSB: preparation of additional corpus from BSB documents:– Random selection of 100 works from digitized images of 16th and 17th century– Mostly related to theology– Latin texts excluded, no poems etc. – Keyed by a service provider– 1766 pages with ~ 858,000 tokens groundtruth material
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 9
Diachronic Groundtruth Corpus (1500-1950) Gains of tokens by the extension of the corpus:
Complete corpus contains ~ 3,380,000 tokens in 500 texts from 4 centuries basis for different analyses and lexicon building
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 10
Coverage on Diachronic Corpus: modern
–
Less than 45% of the vocabulary is covered by modern resources before 1750. 16th century: only 15% - 29% modern simple words, modern closed compounds
are hardly relevant.
1500-
1549
1550-
1599
1600-
1649
1650-
1699
1700-
1749
1750-
1799
1800-
1849
1850-
1899
1900-
1949
Modern simple
words
Modern
compounds
Types (%)
15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1
5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 11
Linguistic Resources for Historical Texts
Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 12
Hypothetical lexicon for rule based variants
Systematic substitution rules (patterns) describe the difference between modern and historical spelling:
(modern) (historic)
Based on the modern lexicon and the 140 manually collected patterns, the set of all potential rule based historical variants can be computed automatically (“hypothetical lexicon”).
theylteil eyeitht ,
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 13
Hypothetical lexicon for rule based variants
…
Esel
…
Teil
…
modern
lexicon
…e →eh
ei →ey
s →ß
l→ll
t →th
…
Esel
Esell
Esehl
Esehll
Eßel
Eßell
Eßehll
…
hypothetical
lexicon
Teil
Teill
Teyl
Teyll
Tehill
Theil
…
pattern set
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 14
Hypothetical lexicon for rule based variants
Automatic mapping from rule based historical variants to their equivalent in the modern vocabulary is possible:
historic modern Geyst = Geist + (ei ey) Theile = Teile + (t th)
By far not all historical variants can be described by simple replacement rules:historic modern
frug = fragte + ?Mirackel = ? + ?
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 15
Coverage on Diachronic Corpus: hypothetic
16th century: 30% of the vocabulary are covered by the lexicon of rule based variants
Applied as OCR-Lexicon via the IMPACT Abbyy External Dictionary Interface: improvement of recognition rate (published 2009)
1500-
1549
1550-
1599
1600-
1649
1650-
1699
1700-
1749
1750-
1799
1800-
1849
1850-
1899
1900-
1949
Modern simple
words
Modern
compounds
Hypothetic
Types (%)
15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1
5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8
29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 16
Coverage on Diachronic Corpus: missing
Especially in the 16th century: Up to 46% “difficult” vocabulary. manually verified lexicon necessary!
1500-
1549
1550-
1599
1600-
1649
1650-
1699
1700-
1749
1750-
1799
1800-
1849
1850-
1899
1900-
1949
Modern simple
words
Modern
compounds
Hypothetic
Missing
Types (%)
15.3 28.8 29.2 31.5 38.1 52.0 54.7 48.0 60.1
5.1 6.1 6.9 8.6 7.13 15.5 20.6 28.1 27.8
29.5 29.8 27.9 26.0 21.9 14.3 8.1 7.7 2.0
45.9 28.7 29.7 26.0 23.5 15.1 13.9 13.5 8.1
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 17
Linguistic Resources for Historical Texts
Diachronic Groundtruth Corpus (1500-1950) Hypothetical lexicon for rule based variants Manually verified lexicon
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 18
Manually verified IR-lexicon: Structure
One entry contains:– Historical word form from the corpus– Corresponding modern word form– Patterns if applicable– Corresponding modern lemma– At least one occurrence in the corpus as a attestation for the reading
Manual assignment of modern word form and lemma Explicit handling of not rule based variants
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 19
Manually verified IR-lexicon: Compilation
Web-based, collaborative user interface User support:
– For rule based variants: Suggestion of the corresponding modern word form by the hypothetic lexicon
– Suggestion of all possible lemmas for the modern word form by a large modern lexicon (CISLEX)
– Concordance list of the historical variant
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 20
Manually verified IR-lexicon: Status
41,600 entries have been created for 24,800 historical word forms from the diachronic corpus, 72,100 attestations were annotated.
IMPACT-Partner in Slovenia und Bulgaria create corresponding lexica with an adapted version of the tool.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15. 10. 2011, IMPACT Conference 21
Thank you.