primary materials1 - is.inf.uni-due.de

Identifying Quotations in Reference Works and Primary Materials1

Andrea ErnstGerlach1, Gregory Crane2,

1 University of DuisburgEssen, Department of Computational and Cognitive Sciences, Lotharstr. 65, 47048 Duisburg, Germany

[email protected]due.de2 Tufts University, Perseus Digital Library, Eaton 124,

Medford MA, 02155, [email protected]

Abstract. Identifying quotations from reference works in primary materials is a very important feature for digital libraries. By adding corresponding citation links to the original text, we can help contextualize the source material. In this paper we introduce an algorithm for identifying citations automatically based on an analysis of the structure of quotations from three different reference works of Latin texts. An evaluation shows that this approach is capable of finding a large number of quotations with which no machine actionable citations are associated. Additionally this approach can be applied for quotations that have been altered in a range of ways from their source.

Keywords: citations, reference works

1 Introduction

The Open Content Alliance, Google Books and other projects have begun creating much larger, more heterogeneous, and unstructured collections of text than were available in the first generation of digital corpora [1, 2]. Many of these documents contain quotations with no citation information or citation information that citation trackers cannot recognize. The connection to the primary materials is, however, essential to interpreting articles in secondary works, since citations only reflect a part of the original document. So it is therefore very important to get access to the context of the quote in order to classify it correctly. Documents with quotes often follow different quotation schemes especially when they were written in a time where no quotation standard existed. Beside the common

1 This work was supported by a grant from the Mellon Foundation.

mailto:[email protected]

mailto:[email protected]

literal direct citation type, other types include citations with a changed word order, and omission of terms and term differences like spelling errors also have to be considered. The paper has the following structure: In Section 2, we give a survey of related work. The following section describes the results of the manual analysis of quotes. Our approach for an automatic identification of citations is presented in Section 4. The approach is evaluated in Section 5, and the last section concludes the paper and gives an outlook on future work.

2 Related Work

To date there has been little research work conducted in the area of automatic quotation identification within historical reference materials and their related primary sources. The Google Books project has begun work in this area through its “popular passages” feature, where on each “About this Book” page the user can find a list of frequently quoted passages from a given book, with the number of times it has been cited in different books. In addition to these initial efforts by Google, there are several areas of related research that are similar to the work presented in this paper. Kinable [3] has explored the variations in how source references are listed in a historical dictionary and proposed using regular expressions to standardize them. This work sought to regularize source references as means of improving information retrieval in a single electronic dictionary rather than exploring means of automatically identifying quotations from a variety of primary texts. Pouliquen et al. have conducted research into automatic quotation detection, but their work focused on identifying direct speech quotations in the domain of international news articles that appear on the internet [4]. Since our algorithm looks for highly similar text between two different documents as a means of finding quotations, some of the most similar research work can be found in the areas of automatic plagiarism detection, a recent overview of which can be found in [5]. Early influential work in the area of duplicate detection in digital documents was reported by Brin et al. [6], while a more recent overview of techniques for finding coderivative digital documents has been provided by [7]. Perhaps the most closely related work is that of Bia et al. [8] who reported on how they adapted software typically used for plagiarism and copydetection in order to compare the text of different historical editions of Don Quixote. A great deal of similar research has also been conducted into text similarity searching or finding the most effective means of supporting search to find highly similar or identical text in different documents. Stein and Meyer zu Eissen introduce

the idea of nearsimilarity search to find plagiarized documents in a large document corpus [9]. Recent studies by Metzler et al. have explored the efficacy of different document representations and similarity measures for finding the similarity between short segments of text (such as web queries) rather than entire documents [10, 11]. Similarly, studies of text reuse, or how text has been borrowed from one document for use in another, such as through paraphrasing or indirect reference, are also related to the work discussed here. John Lee has developed a computational model of text reuse that is specifically designed for classical texts [12]. His model accounts for both surface sentence similarity and a variety of syntactic and semantic features that help to measure “account source alteration patterns,” and demonstrates how this model can be used to measure text reuse in the Synoptic Gospels of the Greek New Testament. Automatic allusion detection is another closely related area of work, although little research has been conducted in this area as well. Takeda et al. [13] have developed a uniform framework based on a variety of stringsimilarity measures in order to semiautomatically find similar Japanese poems in various anthologies. Earlier work by these same authors [14] examines how different algorithms can be used to find similar fragmentary patterns in different classical Japanese poems.

3 Analyses of the Citations

Quotations are, in practice, often not exact. In some cases, our quotations are based on different editions of a text than those to which we have electronic access and we find occasional variations that reflect different versions of the text. We also found, however, that some quotations – especially in reference works such as lexica and grammars – deliberately modify the quoted text – the goal in such cases is not to replicate the original text but to illustrate a point about lexicography, grammar, or some other topic. We therefore conducted a manual analysis of quotations in order to discover particular problems for the identification of citations. A manual analysis of quotations from the Lewis and Short Latin English Lexicon2, Allen and Greenough’s Latin Grammar3 and Anne Mahoney’s Overview of Latin Syntax4 identified different groups of problems. These general groups can be divided into differences within a single term and differences within phrases. The distinction within a phrase can be divided into omission, insertion and substitution of terms.

2 http://www.perseus.tufts.edu/hopper/text.jsp?doc=Perseus:text:1999.04.0059:entry=a%5Eb3 http://www.perseus.tufts.edu/cgibin/ptext?doc=Perseus%3Atext%3A1999.04.00014 http://www.perseus.tufts.edu/cgibin/ptext?doc=Perseus:text:1999.04.0022

The differences can also be classified into regular and irregular differences. The regular differences can be handled by general methods. In contrast the irregular differences must be solved by a more specialized (and less precise) approach.

Another distinction is the length of the citation. In one of the sources that we examined, the Lewis and Short Latin Dictionary, there are 223,223 tagged quotations of Latin. Of these, 146,288 (65.5%) consist of three or more words, and 176,790 (79%) consist of two or more words. The remaining 46,433 (21%) are, however, single word quotations and thus inherently difficult to map uniquely. For single word quotations, we need to use the relative frequency of the term in the overall corpus to estimate the value of a particular match between a quoted word and the potential source text. In one set of commentaries, Greek and Latin, more than half of the quotations contain a single word (12,499 of 23,211, 54%), 24% (5,685) contain two, and only 22% (5,027) contain more than two words. In contexts such as these, however, the quoting document is focusing upon a particular text and we can use this information to help match single words to their source.

3.1 Regular text differences

For the regular text differences we found three types: Case insensitivity, accent characters and changing punctuation. The case insensitivity often occurs because the quote is embedded in a different sentence structure. Accent characters often change to the corresponding unaccented characters. The punctuation sometimes changes (e.g. from a dot to comma) and sometimes is just left out.

3. 2 Irregular text differences

We discovered 4 different classes of irregular text differences: spelling errors, data entry errors, inflections and merging/splitting of terms. The spelling errors (e.g. conlocuti → collocuti) can be distinguished into normal spelling mistakes and the transfer of a term from a historic spelling into a contemporary spelling. Since the paper versions of a reference work and the corresponding primary work are often available in different qualities and possibly pass through a different digitization process we might get incompatible quotes due to OCR errors (e.g. eam → earn). The quotes are often integrated in the text flow of the reference work and thus the quotes are grammatically adapted in the sentence structure of the secondary text. This could lead to changing inflections (e.g. accusative fugam → nominative fuga). We also find

two terms that become one in the quote with small differences based on different spelling conventions (e.g. enim vero → enimvero).

3. 3 Omission

Another variation between the quotations in secondary works and the text in original documents is the omission of a term or terms. While three dots will often signal the fact that a quotation omits text (e.g., Stratum, Naupactum ... fateris ab hostibus esse captas → Stratum, Naupactum, ut modo tute indicasti, nobilis urbis atque plenas, fateris ab hostibus esse captas), we also found the unmarked omission of complete subordinate clauses and the omission of a few terms in general (e.g. caesar suas copias subducit → Caesar suas copias in proximum collem subducit). Another type of omission that we encountered was longer omissions after names; particularly personal names, place names and names of ethnic groups. In the quotes found in secondary works names would typically appear immediately in front of the direct quote. In contrast, the name appears much earlier in the original text (e.g. Caesar maturat ab urbe proficisci → Caesari cum id nuntiatum esset, eos per provincia nostram iter facere conari, maturat ab urbe proficisci). The names sometimes can even be found in a different paragraph.

3. 4 Insertion

First, we find text inserted between the reference to a citation or to an author and a quotation: “consequently abs te appears but rarely in later authors, as in Liv. 10, 19, 8; 26, 15, 12; and who, perhaps, also used abs conscendentibus,”

The next type is the insertion of explanatory material in the quote. These can often be found within brackets (e.g. gallia est omnis divisa in partes tres, quarum [= partium] unam incolunt belgae, ... → Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae,). The last kind of insertion is the naming of the author or at least an abbreviation for the author’s name at the end of the quote and a cross reference to the source text.

3. 5 Substitution

We find three categories of substitution. A changed word order is often used in order to integrate the quote in the text flow. Secondly numbers can appear in different forms in quotation and source: the word is frequently used in the quote instead of the

digit in the original text (e.g. quinque → v). Finally, at times the quantity differs between the citation and the primary material (e.g. duo mila → mille).

4 Automatic citation identification

As a starting point for automatic citation identification we have on the one hand the documents with the citations and on the other hand we have the corresponding original documents. We assume that both kinds of documents have no XML markup and that we know the language of the cited documents in order to use the correct term frequencies for the documents. Additionally we have lists of names for people, ethnic groups, places and authors as well as a list mapping different expressions for numbers onto each other.

For the automatic citation identification we use a four step approach: The first step normalizes both documents. In the next step we identify possible candidates for a quote. In the following step we search in the neighborhood of these candidates, to see if we can find further indicators for citations. In the final step, the unnecessary candidates are sorted out during a pruning process.

4.1 Normalization

Before we start with the text comparison we normalize both texts. In order to take the case insensitivity and differences in punctuation into account we ignore the case as well as punctuation marks. The accented characters are transformed into the corresponding unaccented character. These text modifications produce normalized texts that ignore the regular text differences.

4.2 Finding candidates for citations

In order to find citation candidates we are using overlapping windows for searching the terms. We search each term from the secondary text in the primary text. For each position where a certain term is found in the primary text, the following five terms from the secondary text are also searched in a radius of five terms around the position of the first term found in the primary text. If at least one other term is found we assume that the passage could possibly be a citation and thus search for more terms by using the fuzzy search techniques that are described in the following section. Since we are searching for the terms in regions around the first term of the possible candidate we have already considered that the terms may not be in the correct order. Therefore it

does not matter if the term position is changed through an omission, an insertion or a substitution of terms.

4.3 Fuzzy search for further indicators for citations

The fuzzy search is used to find more candidates for a citation. Here we consider the results of the earlier manual analysis of the citations. For each citation candidate we apply fuzzy search to the following five words within our overlapping window.

For the irregular term differences we use a combination of Levenshtein and Dice coefficient. The omission of terms is handled in different ways. For the omissions after names we use the lists with the different types of names. If the searched term is included in one of the lists, we increase the search region to 50 terms.

For the different number representations we work with a table mapping different versions of a number onto the corresponding digit (e.g. Table 1). If the searched term is found in this table we additionally check if we can find other versions of this number in the search region.

Table 1. Mapping table for numbers in Latin

I unus primus semel singuli → 1

II duo secundus bis Bini 2→

III tres tertius ter Terni 3→

IV ….

Since authors are often mentioned within a citation, we search in the list of author names to see if we can find an author or an abbreviation for one (at least three characters of an author name followed by a dot). Thus “Verg.” matches “Vergil”.

4.4 Pruning

After the fuzzy search for further terms we prune the candidate set of possible citations. We distinguish between an internal and external pruning. Internal pruning sorts out candidates starting at the same position in the secondary text, while the external pruning compares the different search windows.

Internal pruning

Since we assume that a quotation should match a unique chunk in the primary text, at most one of the candidates starting at the same position can be a citation. During the internal pruning all candidates within one search window that have fewer matched terms than the citation candidate with the maximum found terms are deleted.

External pruning

The external pruning process checks if there are different windows overlapping each other that could be merged. Thus the length of the detectable citations is adapted automatically and is not restricted to the original search region of the citation candidates. If we merge two candidates we insert an additional internal pruning step, and if more than one citation candidate was left at the same position, we can delete those where the window is not increased, since they now have fewer indicates for a citation in comparison with the merged candidates.

5 Evaluation

The Perseus Digital Library contains hundreds of thousands of textual quotations with their citations tagged in a machine actionable form (the TEI tag CIT, which contains a QUOTE and a BIBL). We thus could use these already tagged quotations as a test set, comparing the results of our automatic quote identification with the manual tagging. For the evaluation of our approach we took the dictionary entry for the Latin preposition ab, “from,” in the Lewis and Short, A Latin Dictionary5 and searched for citations with 5 terms or more within the documents from Caesar, Cicero and Livy in the Perseus collection. Since we had no term frequencies for Latin in general we calculated term frequencies for the test collection.

Table 2. Recall and Precision values for the test documents Recall Precision Number of

quotationsCaesar 0.95 0.86 19Cicero 0.91 0.39 22Livy 0.94 0.57 18

5 http://www.perseus.tufts.edu/hopper/text.jsp?doc=Perseus:text:1999.04.0059:entry=a%5Eb

With a range from 0.91 to 0.95 for the recall (Table 2) we find more than nine out of ten quotes. For example, we could find the alternation ab → a and we could match octo → VIII within the quote onerariae naves, quae ex eo loco ab milibus passuum octo vento tenebatur → onerariae naves, quae ex eo loco a milibus passuum VIII vento tenebantur. The position of half of the missed citations was also identified but they were not selected as a citation because they had fewer found terms.

Most of these missed citations were not found because they started with a non direct match and/or omissions in the citation and thus were too short to be identified with the used parameters (e.g. jam inde ab infelici pugnà ceciderant animi → iam inde ab infelici pugna castrisque amissis ceciderant animi). We could solve this problem if we use the fuzzy search not only for the following terms but also for the preceding terms of a found term. A general fuzzy search without any direct match would also be possible, but here we expect a decreasing precision. The omissions within the quotes lead to a shorter quote if it is exactly at the position where we look if we should merge the search windows. Thus the citation could not be identified. An increased search window or an optimized merging of the search windows could enhance the approach accordingly.

The precision values have a much larger spread than the recall values. Regarding the false positives we discovered that their terms were much more disarranged than the real citations. Even if some irregularities could occur in a quote, the order should not be too confused (e.g. Fig. 1).

Fig. 1. Disordered false positive quote

Another reason for the low precision was the overlap of English and Latin terms (e.g. a and in). The problem is further complicated by the fact that English is not a highly inflected language and thus contains a lot of short terms. Short terms more easily match with short Latin words, especially if we are using the Levenshtein distance (e.g. the > te).

The evaluation shows that the choice of citation candidates (overlapping search windows) and the pruning criteria (based on comparison among candidates at the same position, of following candidates and the term frequency) work well, since we

have most of the citations in the wider area for the final decision. At the moment the final decision is simply based on the number of found terms. This decision criteria needs to be improved in order to find also shorter citations. Additionally the order of terms in the original texts should be taken into account, since high term frequencies for terms within the false positives often correspond to a highly disarranged word order. For primary and secondary documents consisting of different languages (except for the citations) we should identify the frequent overlapping terms and similar terms based on an advanced similarity measure for both languages respectively in order to identify a sort of bilingual stopword list. Found terms from this list should be ranked lower than other terms. If both documents are written in the same language we could work with a standard stop word list. Furthermore the type of each found term matters. Each direct mapping increases the probability for a quote more than a term found by fuzzy search techniques.

We can therefore identify four main factors for a citation identification: number of terms, term order, term frequencies and stop words. A further development of the approach will likely involve the creation of a similarity measure that takes the above mentioned factors into account. This measure would increase the precision by enhancing the decision process in the pruning phase. Finally, with an optimization of the parameters (such as the size of the search windows), we could also achieve a further increase of the recall.

6 Conclusions and Future Work

In this paper we described an approach for identifying quotations in secondary literature. The evaluation shows that while we achieved a high recall, we still need an improvement of the precision values. This improvement can hopefully be attained with the development of the described similarity measure taking the main factors for a quotation into account. Due to the different adjustable parameters it will not be a problem to integrate further types of fuzzy search techniques into the algorithm (e.g. partial search in a different search window or different similarity measures

Since the word formation for numbers is in most languages quite regular, it could be an enhancement to include in the mapping table just the basic numbers for the word formation (e.g. the numbers from one to ten and afterwards every tenth number until 100 and so on). This list should then be combined with an algorithm that is able to figure out the different digits in the word.

It is still problematic to identify very short quotes, especially when the single terms have a high term frequency. Even though the style for cross references is varying, we could assume that there are certain similarities between them, at least for documents

from one author. For example, a typical crossreference from Lewis & Short is listed as Caes. B. C. 1, 35. This quotation contains an abbreviation for the author (Caesar) followed by an abbreviation for the title of a work (Bellum Civile) and is closed by numerical position descriptions for the quote. Thus we could use a three step approach in order to find more citations. In the first step we could identify quotes with the approach described in this paper. We could then use the retrieved automatically identified citations as a training set for a rule based approach based on [15], where we learn rules for the cross references. With these rules we could then identify the position of the missing quotes. At these positions we could search again with the described approach but with more flexible parameters since we already know that there must be a quote around that position.

A further extension could be an analysis of whether the citation candidates contain very frequent word sequences like combinations of prepositions and nouns. Due to their high frequencies, however, these types of words might lead to a high percentage of misidentified citations.

The described approach could also support the authoring process. During the typing it could autocomplete and correct the quotations. Furthermore it could be possible to find the corresponding primary document and the related references for a quotation entered by the author.

Due to the flexibility of our approach with the automatically expanded search windows, an extension of the described algorithm could also be used for identifying multiple editions of the same work. We could assume one text as a citation of the other. The main difference to citations is based on identifying notes integrated in the different editions.

References

1. Crane, G.: What Do You Do With A Million Books? DLib Magazine. 12. (2006), http://www.dlib.org/dlib/march06/crane/03crane.html

2. Stewart, G., Crane, G., Babeu, A.: A New Generation of Textual Corpora: Mining Corpora from Very Large Collections. In: JCDL 2007: Proceedings of the 7th ACM/IEEECS Joint Conference on Digital Libraries, pp. 356365. ACM Press, New York, 2007

3. Kinable, G.: Computerized Restoration of Historical Dictionaries: Uniformization and Dateassigning in Dictionary Quotations of the Woordenboek der Nederlandsche Taal. Literary & Linguistic Computing. 21, 295310 (2006)

4. Pouliquen, B., Steinberger, R., Best, C.: Automatic Detection of Quotations in Multilingual News. In: Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP'2007)

5. Lukashenko, R., Graudina, V., Grundspenkis, J.: ComputerBased Plagiarism Detection Methods and Tools: an Overview. In: Rachev, B., Smrikarov, A., Dimov, D. (eds). CompSysTech '07: Proceedings of the 2007 International Conference on Computer Systems and Technologies, Article no. 40. ACM Press, New York, 2007

6. Brin, S., Davis, J., and GarcíaMolina, H.: Copy Detection Mechanisms for Digital Documents. In: Carey, M., Schneider, D. (eds.): Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pp. 398409. ACM Press, New York, 1995

7. Hoad, T. C., Zobel, J.: Methods for Identifying Versioned and Plagiarized Documents. Journal of the ASIS&T, 54, 203215 (2003) 8. Zaslavsky, A., Bia, A., Monostori, K.: Using CopyDetection and Text Comparison

Algorithms for CrossReferencing Multiple Editions of Literary Works. In: Research and Advanced Technology for Digital Libraries. 5th European Conference, ECDL 2001. LNCS, vol. 2163, pp. 103114. Springer, Heidelberg (2001)

9. Stein, B., Meyer zu Eissen, S.: Near Similarity Search and Plagiarism Analysis. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds.): From Data and Information Analysis to Knowledge Engineering, pp. 430437. Springer (BerlinHeidelberg) 2005

10. Metzler, D., Dumais, S. Meek, C.: Similarity Measures for Short Segments of Text. In: 29th European Conference on IR Research, ECIR 2007. LNCS, vol. 4425, pp. 1627. Springer, Heidelberg (2007)

11. Metzler, D., Bernstein, Y., Croft, B.W., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: CIKM '05: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 517524. ACM Press, New York, 2005.

12. Lee, J.: A Computational Model of Text Reuse in Ancient Literary Texts. In: 45th Annual Meeting of the Association of Computational Linguistics, pp. 472479. ACL, (2007)13. Takeda, M., Fukuka, T., Nanri, I., Yamasaki, M., Tamari, K.: Discovering Instances of Poetic Allusion from Anthologies of Classical Japanese Poems. Theoretical Computer Science, 292, pp. 497524 (2003)14. Hori, H. Shimozono, S. Takeda, M., Shinohara, A.: Fragmentary Pattern Matching: Complexity, Algorithms and Applications for Analyzing Classic Literary Works. In: Eades, P., Takaoka, T. (eds.): ISAAC 2001, LNCS 2333, pp. 719730. Springer, Heidelberg (2001).15. ErnstGerlach, A., Fuhr, N.: Generating Search Term Variants for Text Collections with Historic Spellings. In: Lalmas, M., MacFarlane, A., Rueger, S., Trombos, A., Tsikrika, T. and Yavlinsky, A. (eds): 28th European Conference on IR Research, ECIR 2006. LNCS, vol. 3936, pp. 4960. Springer, Heidelberg (2006)