semantic blumenbach digital library & virtual museum · project called "semantic...

1
<term xml:lang=‚la' sortKey='Hystrix'>Hystrix</term></hi> </hi> </hi>. <term xml:lang=‚de' sortKey='Stachelschwein'> Stachelschwein</term>. (Fr. <hi rendition="#i"> <hi rendition="#r"> <term xml:lang=‚fr' sortKey=' porc-epic '> porc-<lb type="inWord"/>epic</term></hi> </hi>. Engl. <hi rendition="#i"> <hi rendition="#r"> <term xml:lang=‚fr' sortKey=' porc-epic '> porcupine</term></hi> </hi>.) ….. <p rendition="#l2em">v.<persName ref=''> Schreber </persName><hi rendition="#r">tab</hi>. 169.</p> <p rendition="#l1em">In<placeName ref='#GettyId:'> Canada</placeName>, auf Labrador, um die Hudsons-<lb type="inWord"/>bay etc. Thut zumahl im Winter den jungen<lb/>Baumstämmen großen Schaden.</p> <p rendition="#indent-2">2. <hi rendition="#i"> <hi rendition="#r"><term xml:lang=‚la' sortKey='Hystrix Cristata'>Cristata</term></hi> </hi>. <hi rendition="#r">H. spinis longissimis, capite cri-<lb type="inWord"/>stato, cauda abbreuiata</hi>.</p> Contact: Dr. Jörg Wettlaufer Akademie der Wissenschaften zu Göttingen (ADWG) Göttingen Centre for Digital Humanities (GCDH) Tel.: +49 (0)551 39 20477 [email protected] www.gcdh.de | www.digihum.de Semantic Blumenbach II. Named Entity Recognition on multilingual historical TEI encoded texts Corpus of 12 editions of “Handbuch der Naturgeschichte” from 1779-1830. Named Entity Recognition (NER) on TEI P5 Tite encoded full-texts for places, persons and objects from the natural history domain. Irregular orthography and multilingual texts from the second half of the 18th century. Precision and recall above 90% with a list-based parser. Identification of app. 10.000 terms/text and app. 1000 persons/ 1200 places/ 1300 references to the collection database per text. References to collection database via: <term xml:lang="de" sortKey="Holz"><rs type="Palaeobotanik" ref="101 113 194 195 196 209 313 409 440 642">Holz</rs></term> Scientific Communication Infrastructure References Goerz, Guenther & Martin Scholz: Adaptation of NLP Techniques to Cultural Heritage Research and Documentation, Journal of Computing and Information Technology - CIT 18, 2010, 4, 317–324. Wettlaufer, Jörg & Sree Ganesh Thotempudi: Poster - NER in historical Text corpora. Lessons learned so far. 4.-6.03.2013, Mehr Personen – Mehr Daten – Mehr Repositorien, Tagung des Personendaten-repositoriums der BBAW, Berlin. Object Object Metadata Text (Metadata) Link to catalog http://resolver.sub.uni-goettingen.de/ purl?PPN625161807_0009 http://books.google.de/books? id=fnfwrkZjm9kC D. Joh. Fr. Blumenbach's … Handbuch der Naturgeschichte : nebst zwey Kupfertafeln. – Sechste Auflage. Göttingen : Bey Johann Christian Dieterich, 1799. Abbildungen naturhistorischer Gegenstände 9 (81; 1809): Taf. 81 Semantic is author of has collected Digital Library & Virtual Museum Digital Humanities Research Collaboration Lower Saxony I. Introduction Blumenbach-online, a project of the Göttingen Academy of Sciences and Humanities, started in January 2010 and aims at both digitizing and presenting the writings and collections of the influential Gottingen physician and naturalist Johann Friedrich Blumenbach (1752-1840), one of the founding fathers of physical anthropology, online. To date, almost half of the textual material (77.000 pages altogether) and roughly a quarter of the collections have been digitized and converted into TEI-encoded texts or entered into a database. It is through an application of Semantic Web technologies in a spin-off project called "Semantic Blumenbach" that we hope to explore text-object relationships and establish methods for presenting and providing linked data for Blumenbach-online. genus species Academy of Science and Humanities Göttingen, Germany Martin Scholz, Diplom-Informatiker Friedrich-Alexander-Universität Erlangen-Nürnberg Department Informatik AG Digital Humanities Tel: +49 9131 85 29094 martin.scholz @cs.fau.de Pictures (Metadata) Text „ semblu“ Semantic Blumenbach OWL III. WissKI Project of the Artificial Intelligence Chair from the Department of Computer Science at the Friedrich- Alexander-University of Erlangen-Nuremberg (AI) and the Department of Museum Informatics at the Germanisches Nationalmuseum (GNM) in Nuremberg . The goal of the WissKI project is to apply the concept of Wikis to the scientific domain and to support transdisciplinary collaboration between scientists and researchers from various domains. Drupal 6 based modules with ECRM Top-Ontology VI. Conclusion With the additional modules developed for Semantic Blumenbach, WissKI provides a powerful and attractive environment for connecting both texts and objects in RDF within the CRM-model. Ingest of already annotated TEI-P5 texts with several hundred pages is now possible. WissKI provides a SPARQL Endpoint and supports LIDO as exchange format. Downside: High learning curve for modelling data in CIDOC-CRM. Mark Fichtner Diplom-Informatiker Germanisches Nationalmuseum Referat für Museums- und Kulturinformatik 90402 Nürnberg Tel. +49 911 1331-264 [email protected] Exploration of text-object relationship with sematic web technologies in the history of science Sree Ganesh Thotempudi Digital Humanities Research Collaboration Lower Saxony Göttingen Centre for Digital Humanities (GCDH) Tel.: +49 (0)551 39 20479 [email protected] www.gcdh.de Optimization for this edition Results for 12 editions of the same book Absolute numbers of tagged strings in the different editions of Blumenbachs‘ natural history manual from 1779 - 1830 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 <personName> <placeName> <term> <rs> V. Connecting Texts and Objects Connecting text (terms with reference strings) and objects via unique IDs of collection database (semblu:E42_KerndatenID). Ingest of TEI encoded books with hundreds of pages now possible with newly added WissKI-modules (texttei and book_import) including triplification of entities from the text with a XSLT stylesheet by Martin Scholz. Ingest and Disambiguation of Data from a RDB via ODBC Connector to WissKI by Mark Fichtner. visit us at: wiss-ki.eu & www.blumenbach-online.de & dhfv-ent2.gcdh.de/blumenbach/wisski/ IV. Modelling the data in ECRM Erlangen CRM (ECRM) is an OWL-DL Version of CIDOC Conceptual Reference Model (CRM). It serves as top – ontology allowing for application ontologies (semblu). For internal display and import of data the path- builder module of WissKI is used. By defining groups for linguistic objects and museum objects the system can disambiguate incoming data automatically and therefore connect objects using the same identifier. Good support for local and global authorities by the WissKI-system is available. Graph view of the text-object relationship Pathbuilder Basic model of the text-object relationship

Upload: lelien

Post on 20-May-2018

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Semantic Blumenbach Digital Library & Virtual Museum · project called "Semantic Blumenbach" that we hope ... Mark Fichtner Diplom-Informatiker ... (terms with reference strings)

<term xml:lang=‚la' sortKey='Hystrix'>Hystrix</term></hi> </hi> </hi>. <term xml:lang=‚de' sortKey='Stachelschwein'> Stachelschwein</term>. (Fr. <hi rendition="#i"> <hi rendition="#r"> <term xml:lang=‚fr' sortKey=' porc-epic '> porc-<lb type="inWord"/>epic</term></hi> </hi>. Engl. <hi rendition="#i"> <hi rendition="#r"> <term xml:lang=‚fr' sortKey=' porc-epic '> porcupine</term></hi> </hi>.) ….. <p rendition="#l2em">v.<persName ref=''> Schreber </persName><hi rendition="#r">tab</hi>. 169.</p> <p rendition="#l1em">In<placeName ref='#GettyId:'> Canada</placeName>, auf Labrador, um die Hudsons-<lb type="inWord"/>bay etc. Thut zumahl im Winter den jungen<lb/>Baumstämmen großen Schaden.</p> <p rendition="#indent-2">2. <hi rendition="#i"> <hi rendition="#r"><term xml:lang=‚la' sortKey='Hystrix Cristata'>Cristata</term></hi> </hi>. <hi rendition="#r">H. spinis longissimis, capite cri-<lb type="inWord"/>stato, cauda abbreuiata</hi>.</p>

Contact: Dr. Jörg Wettlaufer Akademie der Wissenschaften

zu Göttingen (ADWG) Göttingen Centre for Digital

Humanities (GCDH) Tel.: +49 (0)551 39 20477 [email protected] www.gcdh.de | www.digihum.de

Semantic Blumenbach

II. Named Entity Recognition

on multilingual historical TEI encoded texts

Corpus of 12 editions of “Handbuch der Naturgeschichte” from 1779-1830.

Named Entity Recognition (NER) on TEI P5 Tite encoded full-texts for places, persons and objects from the natural history domain.

Irregular orthography and multilingual texts from the second half of the 18th century. Precision and recall above 90% with a list-based parser.

Identification of app. 10.000 terms/text and app. 1000 persons/ 1200 places/ 1300 references to the collection database per text.

References to collection database via:

<term xml:lang="de" sortKey="Holz"><rs type="Palaeobotanik" ref="101 113 194 195 196 209 313 409 440 642">Holz</rs></term>

Scientific Communication Infrastructure

References

Goerz, Guenther & Martin Scholz: Adaptation of NLP Techniques to Cultural Heritage Research and Documentation, Journal of Computing and Information Technology - CIT 18, 2010, 4, 317–324.

Wettlaufer, Jörg & Sree Ganesh Thotempudi: Poster - NER in historical Text corpora. Lessons learned so far. 4.-6.03.2013, Mehr Personen – Mehr Daten – Mehr Repositorien, Tagung des Personendaten-repositoriums der BBAW, Berlin.

Object

Object Metadata

Text (Metadata)

Link to catalog http://resolver.sub.uni-goettingen.de/

purl?PPN625161807_0009

http://books.google.de/books?

id=fnfwrkZjm9kC D. Joh. Fr. Blumenbach's … Handbuch der Naturgeschichte : nebst zwey Kupfertafeln. – Sechste Auflage. – Göttingen : Bey Johann Christian Dieterich,

1799.

Abbildungen naturhistorischer

Gegenstände 9 (81; 1809): Taf. 81

Semantic is author of

has collected

Digital Library & Virtual Museum Digital Humanities Research Collaboration – Lower Saxony

I. Introduction

Blumenbach-online, a project of the Göttingen Academy of Sciences and Humanities, started in January 2010 and aims at both digitizing and presenting the writings and collections of the influential Gottingen physician and naturalist Johann Friedrich Blumenbach (1752-1840), one of the founding fathers of physical anthropology, online. To date, almost half of the textual material (77.000 pages altogether) and roughly a quarter of the collections have been digitized and converted into TEI-encoded texts or entered into a database. It is through an application of Semantic Web technologies in a spin-off project called "Semantic Blumenbach" that we hope to explore text-object relationships and establish methods for presenting and providing linked data for Blumenbach-online.

genus species

Academy of Science

and Humanities

Göttingen, Germany

Martin Scholz,

Diplom-Informatiker

Friedrich-Alexander-Universität

Erlangen-Nürnberg

Department Informatik

AG Digital Humanities

Tel: +49 9131 85 29094

martin.scholz @cs.fau.de

Pictures (Metadata)

Text

„ semblu“ Semant ic Blumenbach OWL

III. WissKI

Project of the Artificial Intelligence Chair from the Department of Computer Science at the Friedrich-Alexander-University of Erlangen-Nuremberg (AI) and the Department of Museum Informatics at the Germanisches Nationalmuseum (GNM) in Nuremberg .

The goal of the WissKI project is to apply the concept of Wikis to the scientific domain and to support transdisciplinary collaboration between scientists and researchers from various domains.

Drupal 6 based modules with ECRM Top-Ontology

VI. Conclusion

With the additional modules developed for Semantic Blumenbach, WissKI provides a powerful and attractive environment for connecting both texts and objects in RDF within the CRM-model.

Ingest of already annotated TEI-P5 texts with several hundred pages is now possible.

WissKI provides a SPARQL Endpoint and supports LIDO as exchange format.

Downside: High learning curve for modelling data in CIDOC-CRM.

Mark Fichtner

Diplom-Informatiker

Germanisches Nationalmuseum

Referat für Museums- und

Kulturinformatik

90402 Nürnberg

Tel. +49 911 1331-264

[email protected]

Exploration of text-object relationship with sematic web technologies in the history of science

Sree Ganesh Thotempudi Digital Humanities Research

Collaboration – Lower Saxony

Göttingen Centre for Digital

Humanities (GCDH) Tel.: +49 (0)551 39 20479 [email protected]

www.gcdh.de

Optimization for this edition

Results for 12 editions of the same book

Absolute numbers of tagged strings in the different editions of

Blumenbachs‘ natural history manual from 1779 - 1830

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

<personName>

<placeName>

<term>

<rs>

V. Connecting Texts and Objects

Connecting text (terms with reference strings) and objects via unique IDs of collection database (semblu:E42_KerndatenID).

Ingest of TEI encoded books with hundreds of pages now possible with newly added WissKI-modules (texttei and book_import) including triplification of entities from the text with a XSLT stylesheet by Martin Scholz.

Ingest and Disambiguation of Data from a RDB via ODBC Connector to WissKI by Mark Fichtner.

visit us at: wiss-ki.eu & www.blumenbach-online.de & dhfv-ent2.gcdh.de/blumenbach/wisski/

IV. Modelling the data in ECRM

Erlangen CRM (ECRM) is an OWL-DL Version of CIDOC Conceptual Reference Model (CRM). It serves as top –ontology allowing for application ontologies (semblu).

For internal display and import of data the path- builder module of WissKI is used. By defining groups for linguistic objects and museum objects the system can disambiguate incoming data automatically and therefore connect objects using the same identifier.

Good support for local and global authorities by the WissKI-system is available.

Graph view of the text-object relationship

Pathbuilder

Basic model of the text-object relationship