understanding metadata and its purpose

4
MANAGING TECHNOLOGY ! Understanding Metadata and Its Purpose by Karen Coyle Available online 1 February 2005 ‘‘Metadata is cataloging done by men.’’ 1 The world of information technology is awash in talk of metadata. Everyone today seems to be creating a metadata format. There is a bmetaN tag in HTML to carry metadata for Internet resources; scientists have developed metadata to describe genomes; publishers have a metadata format to facilitate the transfer of promotion and price data to retailers. What is happening in the world of technology that is leading everyone to believe that metadata is the answer? Alternatively, if metadata is the answer, what is the question, and what does it mean for libraries and library catalogs? DEFINING METADATA First we have to define what it is that we mean by metadata. The common definition is that metadata is ‘‘data about data.’’ This definition is catchy, but it does not help us understand what metadata is all about. What follows is much less catchy but it does provide us with a way to understand metadata. To begin with, metadata is constructed information, which means that it is of human invention and not found in nature. A good example of constructed information is the use of longitude and latitude to describe the earth and points thereon. The real planet obviously does not have lines going around it, although we are by now very accustomed to seeing maps and globes with them, but the invention of longitude and latitude allows us to talk about locations on the planet and to navigate precisely across vast expanses with no landmarks to guide us. This leads us to a second necessary characteristic of metadata: metadata is developed by people for a purpose or a function. So a map of a subway system that is handed out to riders uses color coding of routes and symbols to guide the riders through the maze of routes and transfer points. This map is often only barely representative of the actual scale and geography of the city that is served by the subway, but it is useful precisely because it emphasizes a subway-centric view at the expense of geographic accuracy. A road map of the same area would be more true to geography, but if that map were designed by the tourist board it would highlight hotels, museums, points of interest, and parking opportunities. A map of an area used by the hiking club would emphasize topology and natural landmarks. Just as there is no single kind of map that serves all needs, there is no one kind of metadata for documents or other information objects. This is because it is not the object itself that determines the metadata but the needs and purposes of the people who create it and those who it will serve. Without getting too metaphysical, metadata is not the world, it is how we see the world at some moment in time for some purpose. Metadata is also often used as a surrogate for the real thing. In a library catalog, the entries are surrogates for the books on the shelves. While it would be hard for library users to look at each book to determine which one they want, at least the physical book is there. In the digital environment, the surrogate role of metadata is key because many resources are not easily browsable and others do not carry clear data about themselves. The rise in interest in metadata is part of the effort to organize our rather messy world of digital resources and to provide access and services where none existed before. It is also a way to exchange data between disparate stores of resources and to allow searching across digital warehouses. XML AND RDF Two acronyms that you will hear used simultaneously with any mention of metadata are XML and RDF. XML is the eXtensible Markup Language 2 and RDF is the Resource Description Framework. 3 Some people speak of XML and RDF as if they are themselves metadata formats, but this is a confusion between form and content . Both XML and RDF are actually general data formats that can be used for any number of applications. In particular, XML is often used as a document format and is the broader format from which HTML is derived. If you are unfamiliar with the record structure of XML it may seem fairly complex and mysterious. In fact in its basic form it is very simple, although it is possible to create complicated data records with it. If you think of the MARC record as having fields with tags, such as this use of ‘‘245’’ to mean ‘‘title’’: 245 $a Hamlet, Prince of Denmark then XML is just another way to tag a piece of data, although it insists on putting a beginning tag and an ending tag (with a ‘‘/’’ before the tag name) around each data element: btitleNHamlet, Prince of Denmarkb/titleN Karen Coyle, 2176 North Valley Street, Berkeley, CA 94702, USA b[email protected]N. 160 The Journal of Academic Librarianship, Volume 31, Number 2, pages 160–163

Upload: karen-coyle

Post on 31-Aug-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Understanding Metadata and Its Purpose

Karen CoylUSA bkcoy

160 The Jou

MANAGING TECHNOLOGY

! Understanding Metadata and Its Purposeby Karen Coyle

Available online 1 February 2005

‘‘Metadata is cataloging done by men.’’1 emphasize topology and natural landmarks. Just as there is no

The world of information technology is awash in talk ofmetadata. Everyone today seems to be creating a metadataformat. There is a bmetaN tag in HTML to carry metadata forInternet resources; scientists have developed metadata todescribe genomes; publishers have a metadata format tofacilitate the transfer of promotion and price data to retailers.What is happening in the world of technology that is leadingeveryone to believe that metadata is the answer? Alternatively,if metadata is the answer, what is the question, and what does itmean for libraries and library catalogs?

DEFINING METADATA

First we have to define what it is that we mean bymetadata. The common definition is that metadata is ‘‘dataabout data.’’ This definition is catchy, but it does not helpus understand what metadata is all about. What follows ismuch less catchy but it does provide us with a way tounderstand metadata. To begin with, metadata is constructedinformation, which means that it is of human invention andnot found in nature. A good example of constructedinformation is the use of longitude and latitude to describethe earth and points thereon. The real planet obviously doesnot have lines going around it, although we are by nowvery accustomed to seeing maps and globes with them, butthe invention of longitude and latitude allows us to talkabout locations on the planet and to navigate preciselyacross vast expanses with no landmarks to guide us.

This leads us to a second necessary characteristic of metadata:metadata is developed by people for a purpose or a function. So amap of a subway system that is handed out to riders uses colorcoding of routes and symbols to guide the riders through themaze of routes and transfer points. This map is often only barelyrepresentative of the actual scale and geography of the city that isserved by the subway, but it is useful precisely because itemphasizes a subway-centric view at the expense of geographicaccuracy. A road map of the same area would be more true togeography, but if that map were designed by the tourist board itwould highlight hotels, museums, points of interest, and parkingopportunities. A map of an area used by the hiking club would

e, 2176 North Valley Street, Berkeley, CA 94702,[email protected].

rnal of Academic Librarianship, Volume 31, Number 2, pages 160–16

single kind of map that serves all needs, there is no one kind ofmetadata for documents or other information objects. This isbecause it is not the object itself that determines the metadata butthe needs and purposes of the people who create it and those whoit will serve. Without getting too metaphysical, metadata is notthe world, it is how we see the world at some moment in time forsome purpose.

Metadata is also often used as a surrogate for the real thing. Ina library catalog, the entries are surrogates for the books on theshelves. While it would be hard for library users to look at eachbook to determine which one they want, at least the physicalbook is there. In the digital environment, the surrogate role ofmetadata is key because many resources are not easily browsableand others do not carry clear data about themselves. The rise ininterest in metadata is part of the effort to organize our rathermessy world of digital resources and to provide access andservices where none existed before. It is also a way to exchangedata between disparate stores of resources and to allow searchingacross digital warehouses.

XML AND RDF

Two acronyms that you will hear used simultaneously with anymention of metadata are XML and RDF. XML is the eXtensibleMarkup Language2 and RDF is the Resource DescriptionFramework.3 Some people speak of XML and RDF as if theyare themselves metadata formats, but this is a confusionbetween form and content. Both XML and RDF are actuallygeneral data formats that can be used for any number ofapplications. In particular, XML is often used as a documentformat and is the broader format from which HTML is derived.

If you are unfamiliar with the record structure of XML it mayseem fairly complex and mysterious. In fact in its basic form it isvery simple, although it is possible to create complicated datarecords with it. If you think of theMARC record as having fieldswith tags, such as this use of ‘‘245’’ to mean ‘‘title’’:

245 $a Hamlet, Prince of Denmark

then XML is just another way to tag a piece of data, although itinsists on putting a beginning tag and an ending tag (with a ‘‘/’’before the tag name) around each data element:

btitleNHamlet, Prince of Denmarkb/titleN

3

Page 2: Understanding Metadata and Its Purpose

The tags can be anything you would like them to be, as longas you predefine them in a data format definition structure. Soif you prefer, your definition could have any of these for a title:

b245NHamlet, Prince of Denmarkb/245NbtiNHamlet, Prince of Denmarkb/tiN

XML, like the MARC tags and subfields, is essentiallyhierarchical. Its advantage over MARC21 is that it can have asmany hierarchical levels as is necessary, as opposed to MAR-C21’s two levels of tag and subfield. In XML the hierarchiesare ‘‘nested’’ like Russian dolls to whatever level is needed.

The Resource Description Framework (RDF) is a step ortwo beyond XML. RDF emphases the relationships betweendata elements. A key relationship in RDF is ‘‘about,’’ where aWeb resource is the object of the RDF statement, and otherfields in the statement are about that resource. That is thesimplest case. RDF can also make use of relationships such as:

subClassOf

subPropertyOf

member

isDefinedby

and others. RDF is a necessary component of the effortcalled the ‘‘semantic Web,’’4 an effort of the World WideWeb Consortium to add a semantic component to thesharing of data over the Internet. RDF is more complex thanand less used than XML, and it is not clear yet if itsucceeds as a general language to describe the world ofWeb. It definitely seems to require a deeper understandingof certain philosophical concepts than does XML and thenumber of people who find it inherently puzzling (and I amin that group) is much greater than those who see it as asolution. (The example below of a Creative Commonsrecord uses a simple RDF format.)

METADATA FOR DOCUMENT-LIKE OBJECTS

As librarians, we will primarily work with metadata fordocuments and document-like objects, although given our lineof work we could find ourselves storing, organizing, andproviding services around other metadata types such as scientificmetadata. But for this article, I will concentrate on metadata thatdescribes documents, with the main question being how is thismetadata different from library cataloging? Note that themetadata formats introduced in this article (Dublin Core,MODS, and METS) are only three of many that are in usetoday, but they are the three most commonly used in digitallibraries.

Library cataloging is undoubtedly the sine qua non ofdocument metadata. It can trace its origins back to the mid-1800s with Jewett’s and Panizzi’s rules. It is familiar to justabout every moderately educated person in the Anglo-American world. In sheer numbers, instances of library cata-loging greatly overwhelm any other metadata scheme beingused for books (although possibly not for journal articles). Andyet, when developers in Internet applications needed metadatafor online documents, they did not adopt the library standard.In fact, the document metadata standard most often found innonlibrary applications is Dublin Core. To understand why, weneed to look at purposes.

Dublin Core

Because the ‘‘Dublin’’ in Dublin Core refers to Dublin,Ohio, the home of OCLC, and because OCLC has been thesupporting organization for Dublin Core, you might assumethat DC comes from the library tradition. In fact, there was agreat deal of effort to separate Dublin Core from the librarytradition, and that effort has succeeded for the most part. Thepurpose of Dublin Core was to provide a simple set of dataelements for describing documents and other objects on theInternet. It was to be so simple that anyone could create arecord for their own documents. Dublin Core has fifteen‘‘core’’ elements,5 which can be given further detail using somequalifiers. The core elements are very broad, so instead ofauthor, the core has creator, but creator can be further refinedto author or composer, etc. I can easily make a Dublin Corerecord for anything, including this very article which I have notfinished yet:

creator = Karen Coyle

title = Understanding Metadata and its Purpose

date = December, 2004

description = The first draft of an article for Journal ofAcademic Librarianship

subject = metadata

type = text

The hope of Dublin Core was that documents on theInternet would carry their own bibliographic descriptionsand therefore would have coded data elements for informa-tion such as author, title, and date. In a sense, thisrepresents a very librarian-like point of view, which is thatit should be possible to find a document by its author or itstitle. On the Internet today Dublin Core is indeed heavilyused, although it has not resulted in the creation of a catalogof Internet resources. Instead, Dublin Core has become thedocument description metadata for a variety of Web-basedapplications. An example of this is the Creative Commonslicense.

Creative Commons6 is both a Web service and a socialmovement. It was developed by Larry Lessig, a Stanfordlaw professor known for his criticism of the strengthening ofcopyright law at the expense of the public’s rights to useand re-use the ideas of their predecessors.7 In the interest ofmaking it possible for creators to give permission for theuse of their works, a small set of licenses were developedthat can be easily attached to files on the Internet. Theselicenses state what uses and reuses are granted by thecreator of the work. In addition to the license, the CreativeCommons software allows the creator to add a small amountof what librarians would call ‘‘descriptive’’ metadata:creator, title, date, and a short description of the item.These use the Dublin Core data elements creator, title, date,description (coded in the record as ‘‘dc:creator,’’ ‘‘dc:title,’’etc.) (Fig. 1).

To make use of the Creative Commons license requiresno understanding of copyright law or of contracts, and thedescriptive elements are ones that nearly anyone can easilyunderstand. In this sense, Dublin Core has achieved itspurpose by providing a core of descriptive elements that canbe embedded in a variety of Web applications.

March 2005 161

Page 3: Understanding Metadata and Its Purpose

Figure 1Creative Commons License with Dublin Core Metadata Highlighted

One of the things that makes Dublin Core easy and usableby anyone is that there are no cataloging rules involved. This issomething that goes against the grain of library cataloging andit definitely reduces the re-usability of the contents of DublinCore records. There are descriptions of each data element in theDublin Core standard so the meaning of the data element isgenerally defined, but it is equally valid to say ‘‘creator =Karen Coyle’’ as to say ‘‘creator = Coyle, Karen.’’ Theadvantage to this is that Dublin Core is likely to be useful to anumber of different communities and cultures; the obviousdisadvantage is that the content of the fields is not uniformacross applications, making interoperability a problem.

MODS: The Kinder, Gentler MARC

The MARC format is a highly sophisticated record forencoding bibliographic information. It is well known in thelibrary world and supported by library systems in the UnitedStates, Canada, and other countries, especially in the English-speaking world. In the networked environment where descrip-tive metadata can be transferred across systems and can beincluded in or with other kinds of metadata, it would seem tobe ideal to use MARC records for this purpose. The problemfor MARC, however, is that this embedding generally requiresthe use of the XML data structure, and MARC is not an XMLrecord. The Library of Congress has created a way to translatethe MARC record to XML, but it has not gained manyenthusiasts, and probably for good reason: the MARC record islarger and more detailed than most systems need, and its use of

162 The Journal of Academic Librarianship

numeric tags and subfield codes makes it hard to understandwithout considerable training. What was needed was a kinder,gentler version of MARC that could accept the key dataelements from a MARC record and transmit them in an easy-to-understand XML format. So the Metadata Object Descrip-tion Standard (MODS) was born.

MODS uses human-understandable tags in place of thethree-digit tags and subfield codes of MARC (i.e., ‘‘title’’instead of ‘‘245’’). It ignores most of the fixed field dataelements, with the exception of the physical format codes(from the 007) and the many codes for genre (from the008). It also introduces some efficiencies and someinnovations. MODS defines a structure called ‘‘name’’ thatrepresents the fields and subfields for personal and corporatenames and for conferences. This structure can be usedanywhere that names would appear, either as main entries,added entries, or subjects. So with a name field like:

bname type=‘‘personal’’N

bnamePartNShakespeare, Williamb/namePartN

bnamePart type=‘‘date’’N1564–1616b/namePartN

b/nameN

can be used as an author field, or it can become part of asubject heading:

bsubject authority=‘‘lcsh’’N

bname type=‘‘personal’’N

Page 4: Understanding Metadata and Its Purpose

bnamePartNShakespeare, Williamb/namePartN

bnamePart type=‘‘date’’N1564–1616b/namePartN

b/nameN

btopicNBibliographyb/topicN

btopicNPeriodicalsb/topicN

b/subjectN

Although it is derived from MARC21 and is much moredetailed than Dublin Core, MODS has many fewer rules thanMARC21. Like Dublin Core, there are no required fields andall fields are repeatable. MODS carries over many values fromMARC, but it also makes radical departures from MARC21:there are no ‘‘main entry’’ or ‘‘added entry’’ concepts, allauthors are simply authors; and a record can have multipletitles without a single ‘‘main title.’’ When MARC21 records aretranslated to MODS, you get a record in XML that is a kind of‘‘MARC-lite.’’ MODS records can also be created frombibliographic metadata that did not originate as librarycataloging, such as article citations, and it is often used indatabases that will have a mixture of library cataloging andother bibliographic data.

METS—Metadata as Structure

There is document metadata whose purpose is not‘‘description’’ in the cataloging sense of that term. Oneexample is a metadata format that is being used by digitallibraries and archives called Metadata Encoding and Trans-mission Standard (METS).8 METS refers to its role as that of a‘‘wrapper’’ and it serves to hold together the files that make upa digital object. Unlike a bound book, digital documents areoften made up of a number of separate files representing pagesor other units. And unlike a physical book, there is no visiblecover or title page, nor can one thumb through the pages to finda particular place in the book. Think of METS as the binding,cover and navigation for a group of digital files. It also includestechnical information that will be needed to manage andunderstand those files, such as the file formats, the technologyused in scanning if the item began its life on paper, and thedigital transformations and compression that have been used onthe files. What METS does not define is the descriptivemetadata. Instead, it allows those creating the METS records toembed whatever descriptive metadata they wish to use forthose materials. This illustrates an important characteristic ofthe world of metadata, which we also saw in the CreativeCommons example: metadata can be reused rather thanreinvented. METS records routinely carry descriptive metadatain Dublin Core or in MODS.

METADATA AND LIBRARY CATALOGING

So what does all of this have to do with library cataloging, andmost importantly, will metadata replace cataloging? Above Isaid that one of the main problems with the Dublin Core recordis that it lacks cataloging rules and therefore there is littlepredictability between communities or projects in terms of thecontent of the fields. What library cataloging and catalogsprovide is a high degree of conformity in the data captured inthe records. This conformity is a service to users, who canmove from one library to another comfortably. But the mainvalue of the conformity is our ability to catalog cooperativelyand exchange cataloging records between libraries and librarysystems. It also allows library systems vendors to create a

product that can be used in any library, just as the standardsized catalog card could fit into any card catalog drawer.

The efficiencies that result from this conformity areenormous and the library community depends on this for thecataloging of its primary materials. But as libraries move intothe organization of less traditional materials, neither thecataloging rules nor the library systems provide workablesolutions. Imagine that you have an archive that has photo-graphs of your city from the early 20th century, and you’d liketo make these available on the Web. And let’s say that you haveabout one thousand of them. For most of them, you have noidea of the photographer, and often no date. Someone in thepast has penciled on the back what the photograph represents,i.e., ‘‘Main Street, circa 1910.’’ To catalog and produceMARC21 records of these photographs would be very timeconsuming, and the resulting records would have littleinformation. Instead, you can create a Dublin Core record thatsimply has the following:

date = circa 1910

description = Main Street

This record cannot be entered into your online catalog,although records like this can be targets of metasearchtechnologies that allow a single search to go against multipledatabases with different metadata formats. The main advantageis that these records could be quickly and easily created bylibrary staff with a minimum of training, and therefore somemetadata could be created for resources that otherwise wouldget none.

Metadata like Dublin Core lacks the level of predictabilitythat would allow for a broad systematic re-use of the records.In fact, these metadata formats, and those other data formatsthat use them, are often used in ad hoc and stand-alonesystems. As these ad hoc systems begin to exchange data,much as libraries began to in the late 19th century, developersmay indeed come to the conclusion that it is the content of themetadata records, not their record structure, that makes thedifference between a single-system solution and a coherentbibliographic universe. In other words, we may see that whenmetadata grows up, it becomes cataloging.

NOTES AND REFERENCES

1. This quip is alternately attributed to Tom Delsey, of the NationalLibrary of Canada (‘‘Metadata: Cataloguing for Men’’), andMichael Gorman (‘‘. . . metadata is cataloging done by men.’’).

2. The XML standard is defined by the World Wide Web Consortium(at http://www.w3.org/XML/), but many XML applications stand-ards are defined by other groups, such as the e-business standardsgroup, OASIS (http://www.oasis-open.org/).

3. http://www.w3.org/RDF/.4. http://www.w3.org/2001/sw/.5. The 15 Dublin Core elements are as follows: Contributor, Coverage,Creator, Date, Description, Format, Identifier, Language, Publisher,Relation, Rights, Source, Subject, Title, Type. For more informa-tion, see http://dublincore.org.

6. http://www.creativecommons.org.7. Lawrence Lessig is the author of Code and Other Laws ofCyberspace (New York: Basic Books, c1999); The Future OfIdeas: The Fate Of The Commons In A Connected World (NewYork: Random House, 2001); Free Culture: How Big Media UsesTechnology And The Law To Lock Down Culture And ControlCreativity (New York: Penguin Press, 2004).

8. http://www.loc.gov/standards/mets/.

March 2005 163