metadata harvesting

37
Metadata Harvesting Interoperable digital collections

Upload: gareth-kaufman

Post on 03-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Metadata Harvesting. Interoperable digital collections. Distributed libraries. The reality in most digital libraries is that no one location has all the materials that may be of interest. It is often more efficient to allow a number of sites each to retain some of the materials. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Metadata Harvesting

Metadata Harvesting

Interoperable digital collections

Page 2: Metadata Harvesting

Distributed libraries

• The reality in most digital libraries is that no one location has all the materials that may be of interest.

• It is often more efficient to allow a number of sites each to retain some of the materials.

• How can we assure clients that they will see all relevant resources, regardless of which library they search?

Page 3: Metadata Harvesting

Two basic approaches

• One service provider with access to resources stored in multiple locations– Information about all the resources located at the

service provider. – Services (DL scenarios) use the information to

provide connections to resources at multiple locations

• Distributed services– Information kept with the resources– Services, local to each collection, interact with other

collection sites

Page 4: Metadata Harvesting

Two protocols

• Z39.50 – Developed before the web– Protocol for communicating with collection

holders in order to provide services.

• Open Archives Initiative– Recent innovation– Central service provider gathers

information from collection holders

Page 5: Metadata Harvesting

Z39.50 - briefly• Information Retrieval Service Definition and Protocol

Specifications for Library Applications• Initially developed over the OSI network standards• Protocol for information exchange

– Free the information seeker from the need to know the details of the target database configuration

• Each site provides services– Each service queries remote sites for needed information

• Information requests mapped to database queries at the collection site.

• Some inconsistency in the interpretation of queries.

Page 6: Metadata Harvesting

Distributed ResourcesMultiple Services

Service provider -- search, browse, compare, etc.

Data provider

Data provider

Data provider

Data provider

Data provider

Approach 1 - One service provider gathers information about data and uses it to provide services

Page 7: Metadata Harvesting

Distributed data and services

Approach 2: Each system is both a data repository and a service provider. Services query other data providers as needed.

Search, browse

Search, browse, compare

Page 8: Metadata Harvesting

Service provider -- search, browse, compare, etc.

Data provider

Data provider

Data provider

Data provider

Data provider

Each server likely to have its own clients. Difference is whether the information exchange is periodic or ad hoc

Hybrid systems

Page 9: Metadata Harvesting

Open Archives Initiative (OAI)

• Web-based– Uses HTTP to communicate between sites

• Centralized server– Services provided from a site that has

already gathered the information it needs for those services from a distributed collection of sites.

Page 10: Metadata Harvesting

Z39.50

• Special purpose protocol (machine to machine, not web interface)

• Gathers information when it is requested, not on a scheduled basis.

Page 11: Metadata Harvesting

OAI Compared to Z39.50Z39.50 OAI

Content (Objects) Distributed Distributed

World View Bibliographic Bibliographic

Object Presentation Data provider Data provider

Searching is Distributed Centralized

Search done by Data provider Service provider

Metadata searched is Up to date Stale

Semantic Mapping When searching Metadata delivery

Source: oai.grainger.uiuc.edu/FinalReport/JCDL_2003_OAI_Intro.ppt

Page 12: Metadata Harvesting

Open Archives Initiative Protocol for Metadata Harvesting -- OAI-PMH

Repository

OAI

Harvester

OAI

HTTP req (OAI verb)

HTTP resp (XML)

OAI PMH defines an interface between the Harvester and any number of Repositories

Metadata Provider

Service Provider

Implemented as CGI, ASP, PHP, or other

Any system may serve as a harvester, repository, or both

Page 13: Metadata Harvesting

OAI componentsService Providers

and

Data Providers

Requests and Responses

http://www.oaforum.org/tutorial/english/page3.htm#section3

Page 14: Metadata Harvesting

Records• Metadata of a resource.• Three parts

– Header (required)• Identifier (required: 1 only)• Datestamp (required: 1 only)• setSpec elements (optional: 0, 1, or more)• Status attribute for deleted item

– Metadata (required)• XML encoded metadata with root tag, namespace• Repositories must support Dublin Core, other formats optional

– “About” statement (optional)• Right statements• Provenance statements

Page 15: Metadata Harvesting

Identifiers

• Globally unique identifier

• Valid URI– Examples

• oai:<archiveId>:<recordId>• oai:etd.vt.edu:etd-1234567890

– Must resolve to one item• No duplicates• No reuse of previously used identifiers

Page 16: Metadata Harvesting

Datestamps

• Date of last modification of a record– Used only for harvesting (meta metadata?)

• Mandatory for each item in the repository• Two levels of granularity possible

– YYYY-MM-DD– YYYY-MM-DDThh:mm:ssZ

• T … Z = Time zone -- must be GMT

• Allows harvesting incrementally -- get only what is new since last visit– Accessed by arguments from and until

Page 17: Metadata Harvesting

The OAI-PMH verbs

• Each requests a specific response from a data repository

Page 18: Metadata Harvesting

Identify• Function: Description of the archive• Example: http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify• Parameters: none• Errors/exceptions:

– badArgument (there should not be any)• Response format:Element Example Ordinality ‡repositoryName My Archive 1baseURL http://archive.org/oai 1protocolVersion 2.0 1earliestDatestamp 1999-01-01 1deleteRecords no, transient, persistent 1granularity YYYY-MM-DD, YYYY-MM-DDThh:mm:ssZ 1adminEmail [email protected] +compression deflate, compress *description oai-identifier, eprints, friends, … * ‡ Ordinality: 1 = mandatory, 1 only; + = mandatory, 1 only; * = optional, 0 or more

Page 19: Metadata Harvesting

<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">

<responseDate>2006-10-17T01:37:44Z</responseDate><request

verb="Identify">http://www.language-archives.org/cgi-bin/olaca3.pl</request>

− <Identify><repositoryName>OLAC Aggregator</repositoryName><baseURL>http://www.language-archives.org/cgi-bin/olaca3.pl</baseURL><protocolVersion>2.0</protocolVersion><adminEmail>mailto:[email protected]</adminEmail><earliestDatestamp>2002-12-14</earliestDatestamp><deletedRecord>no</deletedRecord><granularity>YYYY-MM-DD</granularity>− <!-- maybe later <compression>identity</compression> -->

Actual response from

http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify

Continued

Page 20: Metadata Harvesting

− <description>− <oai-identifier xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd"><scheme>oai</scheme><repositoryIdentifier>OLACA.language-archives.org</repositoryIdentifier><delimiter>:</delimiter><sampleIdentifier>oai:ethnologue.com:aaa</sampleIdentifier></oai-identifier></description>

Continued

Page 21: Metadata Harvesting

− <description>− <olac-archive type="institutional" xsi:schemaLocation="http://www.language-archives.org/OLAC/1.0/olac-archive http://www.language-archives.org/OLAC/1.0/olac-archive.xsd"><archiveURL>http://www.language-archives.org:8082/dp9/</archiveURL><curator>Steven Bird & Gary Simons</curator><curatorTitle>Coordinators</curatorTitle><curatorEmail>mailto:[email protected]</curatorEmail><institution>Open Language Archives Community</institution><institutionURL>http://www.language-archives.org/</institutionURL><shortLocation>Philadelphia, U.S.A.</shortLocation><location/>− <synopsis>This repository contains all records from OLAC-registered archives. It is intended to be used by services which do not want to harvest individual OLAC archives.</synopsis>− <access>Metadata may be used only subject to the access permissions given by the individual archives.</access></olac-archive></description></Identify></OAI-PMH>

Page 22: Metadata Harvesting

ListMetadataFormats

• Function: retrieve available metadata formats from archive

• Example: archive.org/oai-script?verb=ListMetadataFormats&

• identifier=oai:HUBerlin.de:3000218

• Parameters: identifier (optional)• Errors/exceptions:

– badArgument– idDoesNotExist– noMetadataFormats

Page 23: Metadata Harvesting

− <OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2006-10-17T01:58:06Z</responseDate><request verb="ListMetadataFormats">http://www.language-archives.org/cgi-bin/olaca3.pl</request>− <ListMetadataFormats>− <metadataFormat><metadataPrefix>olac</metadataPrefix><schema>http://www.language-archives.org/OLAC/1.0/olac.xsd</schema><metadataNamespace>http://www.language-archives.org/OLAC/1.0/</metadataNamespace></metadataFormat>− <metadataFormat><metadataPrefix>olac_display</metadataPrefix><schema>http://www.language-archives.org/OLAC/1.0/olac.xsd</schema><metadataNamespace>http://www.language-archives.org/OLAC/1.0/</metadataNamespace></metadataFormat>− <metadataFormat><metadataPrefix>oai_dc</metadataPrefix><schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema><metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace></metadataFormat></ListMetadataFormats></OAI-PMH> Response to

http://www.language-archives.org/cgi-bin/ olaca3.pl?verb=ListMetadataFormats

Page 24: Metadata Harvesting

ListSets

• Function: retrieve set structure of a repository

• Example: archive.org/oai-script?verb=ListSets

• Parameters: resumptionToken (exclusive)• Errors/exceptions:

– badArgument– badResumptionToken– noSetHierarchy

Sets are optional and are used to divide a repository into separate units that will be of interest to different harvesters.

Page 25: Metadata Harvesting

ListIdentifiers• Function: abbieviated form of ListRecords, retrieve only headers• Example: archive.org/oai-script?verb=ListIdentifiers&metadataPrefix= oai_dc&from=2002-12-01

• Parameters:– from (optional)– until (optional)– metadataPrefix (required)– set (optional)– resumptionToken (exclusive)

• Errors/exceptions:– badArgument– badResumptionToken– cannotDisseminateFormat– noRecordsMatch– noSetHierarchy

Page 26: Metadata Harvesting

ListRecords• Function: harvest records from a repository• Example: archive.org/oai-script?verb=ListRecords&

metadataPrefix=oai_dc&set=biology

• Parameters:– from (optional)– until (optional)– metadataPrefix (required) – set (optional)– resumptionToken (exclusive)

• Errors/exceptions:– badArgument– badResumptionToken– cannotDisseminateFormat– noRecordsMatch– noSetHierarchy

Page 27: Metadata Harvesting

GetRecord

• Function: retrieve an individual metadata record from a repository

• Example:archive.org/oai-script?verb=GetRecord&identifier=oai:HUBerlin.de: 3000218

&metadataPrefix=oai_dc

• Parameters:– Identifier (required)– metadataPrefix (required)

• Errors/exceptions:– badArgument– cannotDisseminateFormat– idDoesNotExist

Page 28: Metadata Harvesting
Page 29: Metadata Harvesting
Page 30: Metadata Harvesting

Interoperability

• The goal: communication, without human intervention, between information sources– Books that “talk to each other”

• Live links for references• Knowledge of how to find relevant resources

when needed• Ability to query other information locations

Page 31: Metadata Harvesting

Protocols

• Precise rules for interactions between independent processes– Format of the messages

• Both structure and content

– Specified behavior in response to specific messages

• Many ways to accomplish the same result, but both sides must have the same understanding of the rules of engagement.

Page 32: Metadata Harvesting

Protocol Types

• RPC model– Point to point– Completely open to definition by developer

• Verbs (methods)• Nouns (objects, resources)

– Useful to closed community or group who know about the availability of the resource.

Page 33: Metadata Harvesting

SOAP

• Initial words of the acronym have been discontinued.

• Initially developed as part of the Microsoft .NET paradigm– Now in W3C committee

• Stateless, one-way message exchange paradigm

• XML encoded• Flexibility of RPC, but more constrained in the

way communication is formatted.

Page 34: Metadata Harvesting

REST

• REpresentational State Transfer• An after-the-fact definition of the architecture of the

World Wide Web• The model is

– Client/server– Stateless– Cacheable– Layered

• Resource interface constrained– Restricted verbs– Restricted content types

Page 35: Metadata Harvesting

REST and RPC

• RPC provides flexibility for any type of interaction between any type of resources

• REST provides consistency to allow interaction among resources without prior discovery of accepted actions and responses.

Page 36: Metadata Harvesting

SOAP and REST

• Debate in the Web community about which is the better paradigm for application development

• REST -- restricted, but simple extension of existing Web processes

• SOAP -- added flexibility with cost in terms of bandwidth, security, complexity for development

Page 37: Metadata Harvesting

References

• Giving SOAP a REST http://www.devx.com/DevX/Article/8155• SOAP Version 1.2 Part 0: Primer

http://www.w3.org/TR/2003/REC-soap12-part0-20030624/#L1153• OAI For Beginners - The Open Archives Forum online tutorial:

http://www.oaforum.org/tutorial/index.php• Z39.50 Resource Page:

http://www.niso.org/standards/resources/Z3950_Resources.html• Z39.50 An Overview of Development and the Future (1995)

http://www.cqs.washington.edu/~camel/z/z.html