datech2014 - cataloguing for a billion word library of greek and latin
DESCRIPTION
Presentation of the paper Cataloguing for a Billion Word Library of Greek and Latin by Gregory Crane, Bridget Almas, Alison Babeu, Lisa Cerrato, Anna Krohn, Frederik Baumgardt, Monica Berti, Greta Franzini and Simona Stoyanova in DATeCH 2014. #digidaysTRANSCRIPT
![Page 1: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/1.jpg)
How do you catalog a billion word library?
Bridget Almas1, Alison Babeu1, Frederik Baumgardt2, Lisa Cerrato1, Gregory Crane12,
Greta Franzini2, Anna Krohn1, Simona Stoyanova2
1. Perseus Digital Library, Tufts University
2. Open Philology Project, University of Leipzig
![Page 2: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/2.jpg)
Major points
1. We are interested in the logical structures within/across physical books:
Text Groups, Author Y, Papyri from X
Works, e.g., Vergil’s Aeneid
Individual words, e.g., Arma virumque cano
![Page 3: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/3.jpg)
Major points
2. From a pragmatic perspective, we only need one version of a logical work
(e.g., Tacitus’ Annales). We can use that marked up version as a query that we
match against very large and very error-filled corpora.
![Page 4: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/4.jpg)
Major points
3. A text collection can serve as a catalog, with all other versions of the texts in
that collection (including translations as well as shorter quotations as well as
alternate editions) represented as annotations on that collection.
![Page 5: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/5.jpg)
![Page 6: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/6.jpg)
![Page 7: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/7.jpg)
![Page 8: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/8.jpg)
Adding markup for a citation scheme
<div1 type="Book" n="1">
<milestone ed="p" n="1" unit="card"/>
<l n=”1”>Arma virumque cano, Troiae qui primus ab oris</l>
<l n=”2”>Italiam, fato profugus, Laviniaque venit</l>
<l n=”3”>litora, multum ille et terris iactatus et alto</l>
![Page 9: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/9.jpg)
![Page 10: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/10.jpg)
![Page 11: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/11.jpg)
![Page 12: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/12.jpg)
![Page 13: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/13.jpg)
![Page 14: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/14.jpg)
Our ability to align texts is what makes our approach possible
-- a single version of Goethe’s Faust allows us to organize
thousands of editions.
![Page 15: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/15.jpg)
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
![Page 16: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/16.jpg)
Canonical Text Services URNs
These URNs allow us to represent any particular word in any version of any text -- they
allow us to represent our textual data (including annotations) as a very large RDF
graph.
Its not a million book library but a billion word data set.
![Page 17: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/17.jpg)
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
Canonical Text Services name space
![Page 18: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/18.jpg)
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
Greek literature
Latin literature
![Page 19: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/19.jpg)
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
TextGroup = tlg052
Following the Thesaurus Linguae Graecae, we assign 284 to Aelius Aristides
![Page 20: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/20.jpg)
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
A TextGroup can define any useful collection:
* inscriptions from Ephesus
* the Homeric Hymns
![Page 21: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/21.jpg)
Canonical Text Services URNs
urn:cts:greekLit:tlg0284.tlg052.perseus-grc1
urn:cts:latinLit:phi0474.phi052.opp-lat1
urn:cts:latinLit:stoa0255.stoa004
FRBR ( Functional Requirements for Bibliographic Records) Works
tlg0284.tlg052 designates the Embassy of Achilles by Aelius Aristides
![Page 22: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/22.jpg)
Representing different versions
OCT Loeb
1.41 confidere 1 same confidere 1
1.41 propediem 1 sub prope 1
1.41 insert diem 1
1.41 ipsum 1 same ipsum 1
1.41 eos 1 same eos 1
![Page 23: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/23.jpg)
Representing different versions
OCT Loeb
1.41 confidere 1 same confidere 1
1.41 propediem 1 sub prope 1
1.41 insert diem 1
1.41 ipsum 1 same ipsum 1
1.41 eos 1 same eos 1
We can pragmatically represent the differences between our reference text and all other versions
![Page 24: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/24.jpg)
Representing different versions
OCT Loeb
1.41 confidere 1 same confidere 1
1.41 propediem 1 sub prope 1
1.41 insert diem 1
1.41 ipsum 1 same ipsum 1
1.41 eos 1 same eos 1
The reference text does not have to be the best text -- it does not even have to be perfect. It organizes all other texts,
even with noise.
![Page 25: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/25.jpg)
Conclusions
We are developing the Perseus Corpus of Greek Texts (c. 20m words of Greek
and Latin)
* Based on texts in Perseus
* FRBR metadata from the Perseus Catalog
* Revised XML brought in line with CTS and with the EpiDoc subset of TEI
XML
* Offers an extended “TEI by example”
![Page 26: Datech2014 - Cataloguing for a Billion Word Library of Greek and Latin](https://reader034.vdocuments.site/reader034/viewer/2022051816/546fd434b4af9f0e648b4629/html5/thumbnails/26.jpg)
Conclusions
We are preparing for a Leipzig Corpus
* This would be a superset of the Perseus Corpus
* Ideally much larger
* Initial work will include an additional 20 million words of primarily later Greek
and Latin