dublin core metadata in 1 electronic text repositoriesvam52/eport/documents/dublin... · metadata...

23
DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES 1 The Use of Dublin Core Metadata in Two Electronic Text Literary Repositories: A Celebration of Women Writers and 19 th Century British Novels INFO 662: Metadata Vickie Marre Karasic December 6, 2013

Upload: others

Post on 08-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

1

The Use of Dublin Core Metadata in Two Electronic Text Literary Repositories: A Celebration of Women Writers and 19th Century British Novels

INFO 662: Metadata

Vickie Marre Karasic December 6, 2013

 

Page 2: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

2

Table of Contents

 Introduction....................................................................................................................................3

Dublin Core ....................................................................................................................................3 Electronic Text Collections ...........................................................................................................5

Project Methods and Goals...........................................................................................................5 Data and Results ............................................................................................................................6

Table 1: Mapping Scheme for CWW Collection..................................................................................6 Table 2: Mapping Scheme for British Novels Collection.....................................................................7 Table 3: Percentage of Total Number of DC Metadata Elements Used by Two Collections...............8

Discussion .......................................................................................................................................9 Completeness………………………………………………………………………………………..…...9

Accuracy……………………………………………………………………………………..................10 Consistency .............................................................................................................................................10 Controlled Vocabulary ............................................................................................................................11

Conclusion ....................................................................................................................................13

Appendix A...................................................................................................................................14 Appendix B ...................................................................................................................................15

Appendix C...................................................................................................................................16 Appendix D...................................................................................................................................19

References.....................................................................................................................................22

Page 3: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

3

Introduction

Metadata is often described as “data about data,” or “structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource” (NISO, 2004). Metadata provides users “adequate and correct information” about resources, giving them a clear picture of a resource’s content (Margaritopoulos, Margaritopoulos, Mavridis, & Manitsaris, 2008, p. 106). As many scholars have noted, some of the biggest benefits and drawbacks when working with metadata involve consistency and flexibility. For example, Park and Carpenter (2009) write, “One of the great difficulties surrounding the drive to foster greater metadata consistency is the pull between uniformity and consistency on the one hand and flexibility on the other” (p. 14). Moreover, locally added metadata can provide barriers to interoperability across different metadata collections, especially when local guidelines are unavailable to users. As digital collections continue to grow and promote resource discovery, employing quality metadata is essential to consistent and accurate records (Park, 2009). Metadata quality enables increased resource sharing, access and, ultimately, interoperability across digital collections (Bui & Park, 2006). Thus, the purpose of this project is to examine the use of the Dublin Core (“DC”) Metadata Element Set for measuring interoperability through an analysis of metadata quality across two electronic text literary repositories. The two collections chosen, A Celebration of Women Writers (“CWW”) and the University of Illinois’ 19th Century British Novels (“British Novels”),1 represent diversity in electronic text collections: full-text transcriptions (CWW) and digital text/image collections (British Novels). These two repositories were chosen for their use of Dublin Core and differences in content, format, and quality, in order to provide a comprehensive analysis of metadata quality and interoperability.

Dublin Core

The Dublin Core Metadata Initiative (DCMI), founded in 1995 in Dublin, Ohio, provides a standard for and facilitates resource description within and across collections. With the advent of the Internet and large amounts of online content in the 1990s, the DC scheme developed in order to create community-driven, standardized terms to allow for easier indexing and discovery of scholarly resources (Park, 2013a). DCMI was developed to assist cross-domain resource description; its purpose is not to replace other metadata schemes, but rather to “co-exist…with metadata standards that offer other semantics” (NISO, 2001, p. v). In 2001, the National Information Standards Organization (“NISO”) recognized DCMI as NISO standard Z39.85. DCMI developed as a collaborative effort among librarians, computer scientists, and those in the information technology fields (Park, 2013a). Today, this collaborative tradition continues within the DCMI community, bringing together “people working in a domain with interests related to Dublin core metadata, the use of Dublin Core specifications, and in metadata best practices in the domain” (DCMI, 2013a). As NISO Z39.85 (2001) states, the DCMI community has grown since 1995 from those in the library and information science field to a wide range of global interests, including those in the “arts, sciences, education, business, and government sectors” (p. v). Many opportunities are available for community members to get 1 A Celebration of Women Writers will be referred to as “CWW” and 19th Century British Novels will be referred to as “British Novels” throughout this report for ease of reference.

Page 4: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

4

involved in DCMI practices, including task groups, workshops, regional meetings, conferences, and webinars. The Dublin Core Metadata Element Set consists of 15 simple/unqualified elements to describe the basic structure and characteristics of digital objects (NISO, 2004). The following unqualified items are described in more detail on the DCMI Metadata Element Set site (http://dublincore.org/documents/dces/; DCMI, 2013b):

• Contributor: An entity responsible for making contributions to the resource. • Coverage: The spatial or temporal topic of the resource, the spatial applicability of the

resource, or the jurisdiction under which the resource is relevant. • Creator: An entity primarily responsible for making the resource. • Date: A point or period of time associated with an event in the lifecycle of the resource. • Description: An account of the resource. • Format: The file format, physical medium, or dimensions of the resource. • Identifier: An unambiguous reference to the resource within a given context. • Language: A language of the resource. • Publisher: An entity responsible for making the resource available. • Relation: A related resource. • Rights: Information about rights held in and over the resource. • Source: A related resource from which the described resource is derived. • Subject: The topic of the resource. • Title: A name given to the resource. • Type: The nature or genre of the resource.

Dublin Core has been praised for its simplicity and flexibility in creating, describing, and maintaining metadata for electronic and web resources, which lies in the use of optional, repeatable elements (Park, 2013a). While the above simple elements provide basic metadata descriptions, DC has also developed “qualified” element guidelines, allowing for further refinement to describe an element more specifically. The DC qualifiers can be found on the DCMI site, http://dublincore.org/documents/usageguide/qualifiers.shtml, along with controlled encoding schemes for certain elements, such as LC Authorities for the Subject field. Examples of qualified elements include “Relation – IsPartOf” and “Coverage – Spatial/Temporal” (DCMI, 2013c). While DCMI guidelines provide collections with standards for creating metadata, DC also allows for locally added elements to metadata item records. In fact, NISO Z39.85 (2001) encourages local modifications, stating, “Local or community based requirements and policies may impose additional restrictions, rules, and interpretations” (p. 1). Although many note DC’s uncomplicated structure, ambiguities exist in the form of semantic overlap among certain elements. Some elements are related but differ in purpose, including: source and relation; format and type; subject and description; and creator, contributor, and publisher (see definitions in list above). Two element sets in particular, format and type, and source and relation, are often used interchangeably and inconsistently within and across collections (Park, 2005). Despite these challenges and barriers to interoperability, Dublin Core is a widely and internationally used standard. Many sites that harvest current DC projects exist, including DCMI (http://dublincore.org/projects/), IMLS Digital Content Collections (http://imlsdcc.grainger.uiuc.edu/), ContentDM (http://www.oclc.org/contentdm.en.html), and OAIster (http://www.oaister.org/) (Park, 2013a).

Page 5: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

5

Electronic Text Collections Electronic text collections have become increasingly important in literary studies, as digital collections provide an opportunity for scholars to experiment with texts in novel ways. Electronic text collections also provide great opportunities for students’ interactive learning. Large-scale digitization projects, such as Project Guttenberg and HathiTrust Digital Library, provide scholars and the general public alike with access to numerous works of English literature (Day & Wortman, 2000; Perrault & Aversa, 2013). These repositories have greatly changed literary studies and the resources that information professionals relate to literary scholars. As one such electronic text collection, A Celebration of Women Writers (http://digital.library.upenn.edu/women/writers.html) provides full-text versions of over 150 books by international female authors. Mary Mark Ockerbloom began the collection in 1994 during her time at the University of Pennsylvania; she continues to maintain the repository today (Ockerbloom, 2013). The collection strives to provide free, open access to women’s writings in all genres, with emphasis on fiction and poetry, children’s literature, Canadian authors, and personal accounts of important historical periods and figures (IMLS DCC, n.d.). Images are often used throughout the collection; however, these include book covers or relevant book illustrations, rather than images of text and book pages. The collection is intended for broad audiences, from general users (including young adults) to literary and/or cultural scholars. The collection lacks local guidelines but generally adheres to the IMLS Digital Collections and Content (“DCC”) Collection Description Metadata Scheme; LC Authority Files are used for controlled vocabularies. As another electronic text collection, 19th Century British Novels, part of the University of Illinois Digitized Books Collection, Illinois Harvest (http://illinoisharvest.grainger.uiuc.edu/ fulldisplay.asp?cid=81133), contains over 7,000 digitized versions of 19th century British novels published as serials in the periodical press. The collection provides access to the University of Illinois’ expansive collections, as part of the University’s large-scale digitization effort with the Open Content Alliance (OCA) since 2007 (Illinois Harvest, n.d.). While open to the general public, the collection largely attracts scholarly audiences. British Novels lacks local guidelines but adheres to the IMLS DCC Collection Description Metadata Scheme. The collection generally uses LC Authority Files for controlled vocabularies; however, other authority standards are used occasionally (i.e. “rbgenr” for RBMS Genre Terms).

Project Methods and Goals

The two electronic text literary repositories studied, CWW and British Novels, are housed within the IMLS DCC aggregation. While electronic texts are available directly from the CWW site, item metadata must be accessed through IMLS DCC. Metadata for the British Novels collection can be retrieved through Illinois Harvest, an aggregation site for the University of Illinois Library’s digital collections. To analyze a representative and random sample of each collection’s metadata, thirty records from each collection were chosen (n=60). Mappings were then made between locally added elements (from the IMLS DCC Collection Description Metadata Scheme, Appendix A) and the unqualified Dublin Core element set. Controlled vocabularies for each repository were determined and analyzed, and items were examined for completeness, accuracy, and consistency. When analyzing the collections, the following questions were asked:

Page 6: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

6

• Which DC elements are used most and least frequently in each collection and across collections?

• Which DC elements are not used? • Which elements cause incompleteness, inaccuracies, and inconsistencies? • What hinders interoperability between these two collections?

In answering these questions, the overall goals of this project included examining the use of Dublin Core across two different electronic text repositories; determining controlled vocabularies; analyzing metadata quality in terms of completeness, accuracy, and consistency; and displaying and evaluating interoperability between CWW and British Novels.

Data and Results Before examining the results of this project, it is important to provide mapping schemes for both CWW and British Novels so that it is clear which elements come from the Dublin Core Metadata Element Set and which have been locally added by each collection. In addition to standard metadata guidelines, local or “home-grown” metadata may be useful in providing “rich, detailed descriptions” for particular local communities (Park, Tosaka, & Lu, 2010, p. 1). The tables below show the mapping schemes, including local elements, for the two collections studied.

Table 1: Mapping Scheme for CWW Collection

Display Labels DC Elements

Creator DC: Creator

Type DC: Type

Date DC: Date

Publisher DC: Publisher

Language DC: Language

Format DC: Format

Description DC: Description

Subject DC: Subject

Relation DC: Relation

IsPartOf DC: Relation:IsPartOf

In Table 1, it can be seen that all display labels in the CWW collection can be directly mapped to DC elements. The only “locally” added element in the CWW collection is the “IsPartOf” field to denote from which parent item each particular record comes. Even though this was a local

Page 7: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

7

element, it is in fact a DC qualification of the “relation” element (DCMI, 2013c), and will be treated as a DC element in the results.

Table 2: Mapping Scheme for British Novels Collection

Display Labels DC Elements

Title DC: Title

Identifier DC: Identifier

Creator DC: Creator

Type DC: Type

Publisher DC: Publisher

Date DC: Date

Language DC: Language

Subject DC: Subject

Description DC: Description

Relation DC: Relation

From DC: Source

As can be seen in Table 2, all metadata elements used in the British Novels collection corresponded directly to DC Metadata Elements except for one, the locally added “from” field. For each item examined, this element described the parent collection from which each record came (listed as “19th Century British Novels / University of Illinois Digitized Books”), and has thus been mapped to the DC element “source,” which is defined as “a related resource from which the described resource is derived” (DCMI, 2013b). Table 3 below summarizes the results of 60 records examined across the CWW and British Novels collections. The table displays that for the 30 records in CWW there are 228 metadata elements, representing 7 of the unqualified Dublin Core elements and one qualified element, with no locally added items. For the 30 records in the British Novels collection, there are 225 elements representing 10 of the simple Dublin Core Element Set and 30 items with the locally added “from” field.

Page 8: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

8

Table 3: Percentage of Total Number of DC Metadata Elements Used by Two Collections

CWW British Novels Total

Elements used

% of total elements

% of records

Elements used

% of total elements

% of records

Elements used

% of total elements

% of records

n=30 n=228 n=30 n=30 n=255 n=30 n=60 n=483 n=60

Element

Title 0 0.0% 0% 30 11.8% 100% 30 6.2% 50%

Creator 0 0.0% 0% 28 11.0% 93% 28 5.8% 50%

Subject 26 11.4% 87% 8 3.1% 27% 34 7.0% 57%

Description 25 11.0% 83% 8 3.1% 27% 33 6.8% 55%

Publisher 0 0.0% 0% 30 11.8% 100% 30 6.2% 50%

Contributor 0 0.0% 0% 0 0.0% 0% 0 0.0% 0%

Date 27 11.8% 90% 30 11.8% 100% 57 11.8% 95%

Type 30 13.2% 100% 30 11.8% 100% 60 12.4% 100%

Format 30 13.2% 100% 0 0.0% 0% 30 6.2% 50%

Identifier 0 0.0% 0% 30 11.8% 100% 30 6.2% 50%

Source 0 0.0% 0% 0 0.0% 0% 0 0.0% 0%

Language 30 13.2% 100% 30 11.8% 100% 60 12.4% 100%

Relation 30 13.2% 100% 1 0.4% 3% 31 6.4% 52%

IsPartOf 30 13.2% 100% 0 0.0% 0% 30 6.2% 50%

Coverage 0 0.0% 0% 0 0.0% 0% 0 0.0% 0%

Rights 0 0.0% 0% 0 0.0% 0% 0 0.0% 0%

DC totals 228 100.0% -- 225 88.4% -- 453 93.7% --

Non-DC: from 0 0.0% 0% 30 11.8% 100% 30 6.2% 50%

Total 228 100.0% -- 255 100.0% -- 485 100.0% --

Page 9: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

9

Discussion

According to Park (2009), metadata quality can be assessed according to the functional requirements it meets for the systems it supports (p. 214), as well as commonly utilized criteria, including completeness, accuracy, and consistency (p. 219). The results of Table 3 will be examined in terms of these standards, as well as an analysis of controlled vocabulary. These criteria will not only measure metadata quality in these two collections, but also will determine the level of interoperability between electronic text repositories. Completeness As Duval et al. (2002) discuss, metadata “completeness” refers to resource type, its relation to the local collections and to the overall metadata collection guidelines (as cited in Park, 2013c, p. 15). Since neither the CWW nor British Novels collection lists guidelines, it was found in examining each collection’s 30 items that certain elements are used consistently for resource description, even if these elements contain a “null” field or are present but have a blank entry. As Table 3 displays, the CWW collection uses the following Dublin Core elements most frequently: creator, type, date, publisher, language, format, description, subject, relation, and IsPartOf, which is a qualified element of “relation” (DCMI, 2013c; Appendix B). Every record examined in the CWW collection lists all of these fields and can be said to be “complete” in this way. Although no locally added elements are present in these records, the “IsPartOf” field, a DC qualifier, can be considered as corresponding to the locally added element of the IMLS DCC Collection Description Metadata Schema as a core “collection” element (Appendix A). Each record examined contains “Celebration of Women Writers, A” in its “IsPartOf” field to identify its parent collection (Appendix C, Fig. 1). Even though the creator and publisher fields exist in each item record, records show “null” in these fields for all 30 items examined. This is surprising, considering these elements would not be difficult to map from the CWW site. “Null” also occurred three times in the date field (yielding a completeness rate of 90% (27/30) for this element) and five times in the description field (yielding a completeness rate of 83% (25/30) for this element). Similarly, there were 4 occurrences of the subject field being present but blank (Appendix C, Fig. 1), giving this element an 87% (26/30) completeness rate among the 30 items. All other elements are populated 100% of the time (Appendix B). The British Novels collection was found to have a similar level of completeness. Each record includes the following DC fields most frequently: title, identifier, creator, type, publisher, date, language, subject, and description, with “from” as a locally added field most similar to the DC element “source” (Appendix D, Fig. 1). These elements showed high levels of completeness, with from, title, identifier, type, publisher, date, and language all having 100% representation among the 30 items examined (Appendix B). The creator element showed slight incompleteness with only 2 items missing this field, for completeness of 93% (28/30). The subject and description fields were not as complete; when there was no information present for these fields, they were not listed in the item record (Appendix D, Fig. 1), unlike in the CWW collection. Both subject and description elements were present in 27% (8/30) of the items examined. Aside from these three fields, the records in the British Novels collection can be considered complete in reference to the local and standard (DCMI) guidelines. Across the two collections, the most frequently used DC elements were date, language, type, subject, and description. The least populated were contributor, source, rights, and coverage; in fact, 0 elements in these four fields were populated across all 60 records (Appendix

Page 10: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

10

B). This may indicate that this information is either not relevant or not known, since many of the items in these electronic text collections are out of copyright and from the 19th and early-20th centuries. All remaining elements (title, identifier, publisher, format, and relation) were used about half of the time; relation was used 52% of the time across collections, which accounts for its one-time use in the British Novels collection for a single record out of the 30 examined there. Accuracy Park (2009) describes accuracy in metadata quality as it “concerns the accurate description and representation of data and resource content” which also takes into account “accurate data input” (p. 220). A number of inaccuracies were found in the 30 records examined in the CWW collection. The most common occurred in the subject and description fields. The CWW collection uses Library of Congress Subject Headings as a standard vocabulary for subject description; in fact, the CWW website can be searched by LCSH for ease of discovery (Ockerbloom, 2013). Except for four records that contained a subject with no information or a “null” field, the other 26 records examined included standard LC subject headings, a call number, or a combination of these two. For example, only 3 of the 30 records (10%) contained LC subject headings alone; 50% (15/30) included LCSH and a call number; and 8 out of 30 (27%) only included a call number in the subject field (Appendix C, Fig. 2). In this case, a call number could be considered a “classification code,” as defined by the Metadata Element Set guidelines for “subject” (DCMI, 2013b). The local IMLS DCC guidelines do not have requirements for subject other than inclusion of one heading from GEM (Gateway to Educational Materials), which, ironically, is not included in the CWW records examined. In the description field, the most common inaccuracy consisted of publisher information in lieu of a description. Aside from 5 entries that had “null” description fields, 16 of the 30 records examined (53%) included publisher information in place of a description (Appendix C, Fig. 1). The British Novels collection revealed more accurate metadata quality. However, the description field included inaccuracies in the type of information presented and the controlled vocabulary used. As noted above, only 8 of the 30 records examined (27%) in British Novels contained a description field. Five of these fields contain LC subject headings as descriptions (or variations on LCSH, which will be discussed in the consistency section), while the other three merely contain information about other volumes of the work at hand. Another minor inaccuracy in this collection concerns the “type” element. In 28 of the 30 records examined (93%), type adheres to DC guidelines in being described as “text,” as each of these elements represents printed monographs. In two of the item records, however, type also includes other subject headings according to certain vocabularies. For example, one item indicates “Mystery and detective fiction Great Britain” in the “type” field, with “rbgenr” after it to indicate this as an LC RBMS controlled vocabulary (Appendix D, Fig. 2). These inaccuracies stand out in a collection that otherwise employs accurate metadata. Consistency As another significant indication of metadata quality, consistency can be measured in terms of conceptual/semantic levels (i.e. controlled vocabulary) and structural levels (i.e. date formats) (Park, 2009). Inconsistency begets confusion among elements; for example, interchanging DC elements source and relation (Park, 2013c). As mentioned previously, the CWW collection showed inconsistency in its description field, which, for 16 out of 30 records (53%), contained publisher information. Another inconsistency in the CWW collection occurs in

Page 11: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

11

the date field. According to W3C Date and Time Formats (http://www.w3.org/TR/NOTE-datetime) as recommended by DCMI, “complete” dates should be in the form YYYY-MM-DD, while other acceptable formats include YYYY-MM and YYYY. Of the CWW dates examined, 3 were “null,” 20 of 30 (67%) list the year in YYYY format, and 7 of 30 (23%) list the complete date, YYYY-MM-DD. Interestingly, 8 of the 30 dates are recent (1994 to present), presumably when these items were added to the collection; all other dates list the publication date of the item (Appendix C, Fig. 3). In the British Novels collection, inconsistency mainly occurs in how controlled vocabulary is used. As noted above, only 8 of the 30 records examined (27%) contained a description field. In half of these records, the description field includes variations on LC subject headings. For example, names mostly appear in this collection in the format “last name, first name,” with initials and full names used interchangeably. The record for Wilkie Collins’ The Woman in White / Vol. 1 contains a description including “Hubin, A.J.” when the LC subject heading for this person should be “Hubin, Allen J.” (Appendix D, Fig. 2). Another item only uses an author’s familiar name “Sadleir” in the description field, when the LC authority name is “Stoney, F. Sadleir (Francis Sadleir), 1834-1927” (Appendix D, Fig. 3). Thus, these are matters of inconsistent ordering and linguistic forms of LC subject headings and of DC elements, as subject and description are used interchangeably. Overall, the 30 records examined between the CWW and British Novels collections show varying levels of completeness, accuracy, and consistency. For example, CWW showed the most weakness in completeness, with two fields consistently displaying “null” values. British Novels showed strong levels of completeness and accuracy but, like elements in the CWW collection, revealed inconsistencies in formats and controlled vocabulary in the subject field. Semantic interoperability was relatively straightforward between these two collections, as both used Dublin Core standards and both were available through the IMLS DCC site. Certain local collection elements were spotted early on (i.e. “from” in British Novels and “IsPartOf” in CWW) and were consistent throughout each collection. Controlled Vocabulary As a subset of natural language, controlled vocabulary allows users to employ synonyms and equivalent terms to better access resources classified in and across collections (Morville & Rosen, 2006). In the context of knowledge representation, controlled vocabularies provide “value-added quality” to information for better retrieval (Svenonius, 2000, p. 127). Similarly, in the metadata field, controlled vocabulary can be defined as “vocabularies tagged to resources and documents” and includes organization systems such as thesauri, name and authority files, and classification schemes (Park, 2013b, p. 3). Keeping these definitions in mind, various observations about controlled vocabularies are made below in relation to the data retrieved from the CWW and British Novels collections. For ease of searching throughout the collection, Mary Mark Ockerbloom has created various search fields, including: author, date, country of origin, author ethnicity, general category (i.e. fiction, biography, poetry), copyright status, and “LC subject” (Ockerbloom, 2013). The collection does not contain guidelines on metadata creation or use of controlled vocabulary, nor does the IMLS Digital Collections and Content (DCC) repository. In searching the various fields within the CWW collection, the controlled vocabulary stems mostly from Library of Congress Authorities. A search by “LC Subject” lists all of the subject headings under which texts in the collection are categorized (Appendix C, Fig. 4). Although CWW does list common authority

Page 12: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

12

terms, such as names of countries and author ethnicities, such terms can be considered domain-specific, as they list authors or works for better organization within this collection. Similarly, searches by author groupings (by letter of last name), dates (by century), and copyright (by “copyright expired,” “copyright renewed,” or “reproduced with permission” (Ockerbloom, 2013)), are specific to this collection and do not seem to contain a widely shared controlled vocabulary scheme. Author names, however, do conform to those listed in the Union List of Artist Names; one example is a 19th-century British author, Kate Greenaway, whose name in CWW conforms to the preferred term and convention in the Union List, “Greenaway, Kate” (Appendix C, Fig. 5). British Novels also follows controlled vocabulary schemes in line with the Library of Congress Authorities. The British Novels collection is part of the University’s large-scale effort to digitize scholarly resources (Illinois Harvest, n.d.); the collection does not provide guidelines on metadata creation or controlled vocabularies. For example, a record from Charles Dickens’s David Copperfield includes subjects, such as Orphans, Boys, Young Men, and Child Labor, all “authorized references” according to the LC Authority scheme (Appendix D, Fig. 4). The collection also provides a link to the MARC XML record for each item; however, the XML is merely a document tree without an identifiable scheme (Appendix D, Fig. 4). Because the British Novels collection contains images of books, including those that can be viewed directly on the site or in downloadable PDF or e-Pub form, the images contain item records that link to Open Library, where the digital book is hosted. In this way, any vocabulary associated with the book’s image is domain-specific, to ensure that the University of Illinois site links to the image on Open Library (for example, physical description metadata that includes the volume number, number of pages, and publisher name and location). Because the images in this collection mainly contain text, they could not easily be matched to controlled vocabulary schemes in thesauri for visual resources. Continuing with the David Copperfield example, even though images exist within the book (Appendix D, Fig. 5), the closest match to image vocabulary would include a search for Charles Dickens as a controlled author term in ARTstor, according to its Languages and Literature subject guide (ARTstor, 2012). Such a search only yields images of Dickens himself and characters in David Copperfield, but nothing that would link to the record of the book-image in the Illinois Harvest collection (Appendix D, Fig. 6). The example in the British Novels collection shows the importance of controlled vocabulary in metadata and of normalizing terms (Svenonius, 2000), for linking resources both within the same collection and to those in external collections. In each of the collections examined, item records exist in at least two different places: for CWW, the records are found in IMLS DCC while the actual text of works appears on the CWW website; for the British Novels collection, item records appear on both the Illinois Harvest site and the University of Illinois Library collection page. Because the British Novels collection also links to book images in Open Library, metadata such as Dublin Core elements of format, publisher, and location create important links between these two resources; however, the terminology used is not standard between the two databases (i.e. the Illinois Harvest record uses the “Physical Description” field, whereas Open Library uses a field called “The Physical Object” to provide the same information). Even though these two resources are clearly linked, this example displays the need for controlled vocabularies and standardized terms, as locally added metadata creates difficulty in developing interoperability and resource discovery across collections (Park et al., 2010).

Page 13: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

13

Conclusion

One of the original goals of this project consisted of measuring metadata quality to assess interoperability across the two electronic text collections studied, CWW and British Novels. Difficulties in achieving interoperability between these two collections arise from interchangeable data in description and subject elements in addition to name and date formatting issues in each collection. The five most frequently used elements across the two collections studied (date, language, type, description, and subject) are not always accurately or consistently populated. Although completeness of elements is high within each collection, greater completeness could be achieved across collections, especially since each repository uses similar Dublin Core unqualified elements.

Throughout this evaluation, an overall challenge has been the lack of local metadata guidelines for these collections and trying to determine how metadata was implemented and which elements were required in each record. The addition of local guidelines to these collections and/or a mapping of IMLS DCC guidelines to other schemes would greatly improve the navigation and interoperability of the collections. The IMLS DCC recognizes the need to create mapping guidelines and, according to the website, is in the process of creating such a crosswalk (Shreeves & Jackson, 2008). Overall, Dublin Core’s flexibility and simplicity provides a strong basis for interoperable records across electronic text repositories. However, ambiguities among certain DC elements, in addition to a lack of local and/or mapping guidelines among collections, hampers ease of navigation and interoperability across such repositories.

Page 14: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

14

Appendix A IMLS DCC Collection Description Metadata Scheme

(For more complete guidelines, see http://imlsdcc.grainger.uiuc.edu/CDschema_elements)

Page 15: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

15

Appendix B Percentages of Dublin Core Elements Used Across Collections

Element Name

Element Description CWW British Novels

Percentage across

collections

Collections using DC elements

Title A name given to the resource. 0% 100% 50% 1 out of 2

Creator An entity primarily responsible for making the resource. 0% 93% 47% 1 out of 2

Subject The topic of the resource. 87% 27% 57% 2 out of 2

Description An account of the resource. 83% 27% 55% 2 out of 2

Publisher An entity responsible for making the resource available. 0% 100% 50% 1 out of 2

Contributor An entity responsible for making contributions to the resource. 0% 0% 0% 0 out of 2

Date A point or period of time associated with an event in the lifecycle of the resource. 90% 100% 95% 2 out of 2

Type The nature or genre of the resource. 100% 100% 100% 2 out of 2

Format The file format, physical medium, or dimensions of the resource. 100% 0% 50% 1 out of 2

Identifier An unambiguous reference to the resource within a given context. 0% 100% 50% 1 out of 2

Source A related resource from which the described resource is derived. 0% 0% 0% 0 out of 2

Language A language of the resource. 100% 100% 100% 2 out of 2

Coverage The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant. 0% 0% 0% 0 out of 2

Relation A related resource. 100% 3% 52% 2 out of 2

Rights Information about rights held in and over the resource. 0% 0% 0% 0 out of 2

IsPartOf A related resource in which the described resource is physically or logically included. 100% 0% 50% 1 out of 2

Page 16: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

16

Appendix C CWW Sample Records

Figure 1: Item record

Figure 2: Item record

This record shows a range of quality issues: the locally added “IsPartOf” element (qualifier to DC relation element); “null” creator and publisher fields; a blank subject field; and publisher information in the “description” field.

This record uses both an LC subject heading and a call number in the “subject” field.

Page 17: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

17

Figure 3: Item record

Figure 4: CWW Organization by LC Subjects

Retrieved from http://digital.library.upenn.edu/women/wr-LCSUB.html

This item record displays a complete date in terms of DCMI and W3C format (YYYY-MM-DD), but is otherwise inconsistent, with the date the resource was added to the collection, not the item’s publication date.

Page 18: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

18

Figure 5: CWW and Union List of Artist Names Author Search

Retrieved from http://digital.library.upenn.edu/webbin/wnsearch?searchtype=containing&name=Greenaway&firstyear=&lastyear=&birthyear=&deathyear=&country=any&ethnicity=any

Retrieved from http://www.getty.edu/vow/ULANFullDisplay?find=Greenaway&role=&nation=&prev_page=1&subjectid=500030743

Page 19: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

19

Appendix D British Novels Sample Records

Figure 1: Item record

This record displays the locally added “from” field. Neither a subject nor description element is present. Figure 2: Item record

This record displays subject information in the “type” field and an inconsistently formatted name in the “description” field.

Page 20: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

20

Figure 3: Item record

This record shows a subject term in the “description” field and an incorrectly formatted LC subject term for the name “Sadleir.” Figure 4: Item record and XML file

Retrieved from http://libsysdigi.library.illinois.edu/OCA/Books2010-10/davidcopperfield1/davidcopperfield02dicke/

Page 21: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

21

Figure 5: Image of David Copperfield, from Open Library (linked to British Novels)

Retrieved from http://archive.org/stream/davidcopperfield02dicke - page/574/mode/2up Figure 6: Sample images for “Charles Dickens” in ARTstor

Retrieved from http://library.artstor.org.ezproxy2.library.drexel.edu/library/welcome.html - 1

Page 22: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

22

References

ARTstor. (2012). Languages and literature. Retrieved from http://www.artstor.org.ezproxy2.library.drexel.edu/using-artstor/u- pdf/sg_languages_lit.pdf Bui, A., & Park, J-r. (2006). An assessment of metadata quality: A case study of the National Science Digital Library Metadata Repository. In CAIS/ACSI 2006 Information Science Revisited: Approaches to Innovation, Haidar Moukdad (ed.). Proceedings of the 2006 annual conference of the Canadian Association for Information Science (pp. 1-13). Retrieved from http://www.cais-acsi.ca/proceedings/2006/bui_2006.pdf Day, B. H., & Wortman, W. A. (2000). Introduction: Collaborative partnerships. In B. H. Day & W. A. Wortman (Eds.), Literature in English: A guide for librarians in the digital age (pp. 1-19). Chicago: ACRL. Dublin Core Metadata Initiative. (2013a). Community and events. Retrieved from http://dublincore.org/community-and-events/ Dublin Core Metadata Initiative. (2013b). Dublin Core Metadata Element Set, Version 1.1. Retrieved from http://dublincore.org/documents/dces/ Dublin Core Metadata Initiative. (2013c). Dublin Core Qualifiers. Retrieved from http://dublincore.org/documents/usageguide/qualifiers.shtml Illinois Harvest. (n.d.). Collection development policy. Retrieved from http://illinoisharvest.grainger.uiuc.edu/collection_policy.asp IMLS Digital Collections and Content (DCC). (n.d.). A Celebration of Women Writers: Collection description. Retrieved from http://imlsdcc.grainger.uiuc.edu/Detail/Collection/70283 Margaritopoulos, T., Margaritopoulos, M., Mavridis, I., & Manitsaris, A. (2008). A conceptual framework for metadata quality assessment. Proceedings of the International Conference on Dublin Core and Metadata Applications, 2004, (pp. 104-113). Retrieved from

http://dcpapers.dublincore.org/pubs/article/view/923/919 Morville, P., & Rosenfeld, L. (2006). Thesauri, controlled vocabularies, and metadata. Information Architecture for the World Wide Web (3rd ed.). Retrieved from http://proquestcombo.safaribooksonline.com.ezproxy2.library.drexel.edu/0596527349?ui code=drexelu

Page 23: DUBLIN CORE METADATA IN 1 ELECTRONIC TEXT REPOSITORIESvam52/eport/documents/Dublin... · Metadata quality enables increased resource sharing, access and, ultimately, interoperability

DUBLIN CORE METADATA IN ELECTRONIC TEXT REPOSITORIES

23

National Information Standards Organization. (2001). The Dublin Core Metadata Element Set. Retrieved from http://www.niso.org/apps/group_public/download.php/6578/The%20Dublin%20 Core%20Metadata%20Element%20Set.pdf National Information Standards Organization. (2004). Understanding metadata. Retrieved from http://www.niso.org/publications/press/UnderstandingMetadata.pdf Ockerbloom, M. M. (2013). A Celebration of Women Writers. Retrieved from http://digital.library.upenn.edu/women/ Park, J-r. (2005). CAIS/ACSI 2005 Data, Information, and Knowledge in a Networked World, Liwen Vaughan (ed.). Proceedings of the 2005 annual conference of the Canadian Association for Information Science. The University of Western Ontario, London, Ontario. June 2 - 4, 2005 (pp. 1-12). Retrieved from http://www.cais-acsi.ca/proceedings/2005/park_J_2005.pdf

Park, J-r. (2009). Metadata quality in digital repositories: A survey of the current state of the art. Special issue on Metadata and Open Access Repositories (Michael S. Babinec and Holly Mercer Eds.). Cataloging and Classification Quarterly 47(3/4). Park, J-r. (2013a). INFO 662 Lecture 3: Dublin Core (DC) Metadata Scheme [Word document]. Retrieved from https://learn.dcollege.net/ Park, J. (2013b). INFO 662 Lecture 6: Controlled vocabulary [Word document]. Retrieved from https://learn.dcollege.net/ Park, J-r. (2013c). INFO 662 Lecture 8: Metadata quality [Word document]. Retrieved from https://learn.dcollege.net/ Park, J-r., & Carpenter, B. (2009). Encoded Archival Description (EAD) metadata scheme: An analysis of use of the EAD headers. Journal of Library Metadata (pp. 1-16). Retrieved from https://learn.dcollege.net/ Park, J., Tosaka, Y., & Lu, C. (2010, February 23-26). Locally added home-grown metadata semantics: Issues and implications. The 11th International Society for Knowledge Organization (ISKO) Conference. Rome, Italy. Retrieved from https://learn.dcollege.net/ Perrault, A.H., & Aversa, E.S. (2013). Information resources in the humanities and the arts. Santa Barbara: Libraries Unlimited. Shreeves, S.L., & Jackson, A.S. (2008). IMLS DCC collection description metadata scheme. Retrieved from http://imlsdcc.grainger.uiuc.edu/CDschema_elements Svenonius, E. (2000). Subject languages: Introduction, vocabulary selection, and classification. The Intellectual Foundation of Information Organization. Cambridge, MA: MIT.