advanced oai-pmh michael l. nelson [email protected] mln/ several slides from herbert van de sompel,...

118
Advanced OAI-PMH Michael L. Nelson [email protected] http://www.cs.odu.edu/~mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu University of Southern California 6/15/04

Upload: bertha-walton

Post on 25-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Advanced OAI-PMH

Michael L. [email protected]

http://www.cs.odu.edu/~mln/

Several Slides from Herbert Van de Sompel, Simeon Warner and

Xiaoming Liu

University of Southern California6/15/04

Page 2: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Outline• Guidelines, recommendations, best practices

for 2.0 implementations– harvesters, repositories, aggregators, optional

containers

• Novel applications of OAI-PMH• New developments in the OAI-PMH community

– resource harvesting• mod_oai

– static repositories– oai-rights– ERRoLs

Page 3: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Repository Implementation

(see also Repository Implementation Guidelines:http://www.openarchives.org/OAI/2.0/guidelines-repository.htm)

Page 4: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Minimal Repository• 2.0 provides many expressive, but optional

features– but still low barrier!

• if you are writing your own repository software, the quickest path to implementation can involve initially:– only supporting DC– skipping: <about>, sets, compression– skip flow control (resumptionTokens) if < 1000 items

• add optional features as requirements and familiarity allows

Page 5: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Be Honest with datestamp!

• a change in the process of dynamic generation of a metadata format really does mean all records have been updated!– harvester caveat: an incremental harvest could yield

an entire repository dump if all the date stamps change (for example, if the metadata mapping rules change)

if (internalItemDatestamp > disseminationInterfaceDatestamp) { datestamp = internalItemDatestamp} else { datestamp = disseminationInterfaceDatestamp}

Page 6: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Not Hiding Updates

• OAI-PMH is designed to allow incremental harvesting

• Updates must be available by the end of the period of the datestamp assigned, i.e. – Day granularity => during same day– Seconds granularity => during same second

• Reason: harvesters need to overlap requests by just one datestamp interval (one day or one second) – in 1.x, 2 intervals were required (in many

circumstances)

Page 7: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

State in resumptionTokens

• HTTP is stateless• resumptionTokens allow state information to

be passed back to the repository to create a complete list from sequence of incomplete lists

• EITHER – all state in resumptionToken• OR – cache result set in repository

Page 8: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Caching the Result Set

• Repository caches results of initial request, returns only incomplete list

• resumptionToken does not contain all state information, it includes:

– a session id– offset information, necessary for idempotency

• resumptionToken allows repository to return next incomplete list

• increased complexity due to cache management

– but a potential performance win

Page 9: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

All State in the resumptionToken

• Arrange that remaining items/headers in complete list response can be specified with a new query and encode that in resumptionToken

• One simple approach is to return items/headers in id order and make the new query specify the same parameters and the last id return (or by date)– simple to implement, but possibly longer execution times

• Can encode parameters very simply:<resumptionToken>metadataPrefix=oai_dc

from=1999-02-03&until=2002-04-01&

lastid=fghy:45:123</resumptionToken>

Page 10: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

resumptionToken attributes (1)

• expirationDate – likely to be useful when cache clean-up schedule is known– Do not specify expirationDate if all state in

resumptionToken• badResumptionToken error to be used if

resumptionToken expired– May also be used if request cannot be completed for some

other reason• e.g.: if repository changes cause the incomplete list to have no

records– issue badRT’s judiciously; it can invalidate a lot of effort by a

lot of harvesters

Page 11: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

resumptionToken attributes (2)

• completeListSize and cursor optionally provide information about size of complete list and number of records so far disseminated– not (currently) widely used– use consistently if used– designed for status monitoring– caveat harvester: completeListSize may

be approximate and may be revised

Page 12: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

resumptionTokenThe only defined use of resumptionToken is as follows:

•a repository must include a resumptionToken element as part of each response that includes an incomplete list;

•in order to retrieve the next portion of the complete list,  the next request must use the value of that resumptionToken element as the value of the resumptionToken argument of the request;

•the response containing the incomplete list that completes the list must include an empty resumptionToken element;

Page 13: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Flow Control & Load Balancing

How to respond to a “bad” harvester:

1. HTTP status code 200; response to OAI-PMH request with a resumptionToken.

2. HTTP status code 503; with the Retry-After header set to an appropriate value if subsequent request follows too quickly or if the server is heavily loaded.

3. HTTP status code 403; with an appropriate reason specified if subsequent requests do not adhere to Retry-After delays.

Page 14: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

302 Load Balancing• Interactive users on main DL machine should not be

impacted by metadata harvesting– don’t take deliveries through the front door– not part of the protocol; defined outside the protocol

OAIServer

naca.larc.nasa.gov/oai/

if load > 0.50redirect request

OAIServer

buckets.dsi.internet2.edu/naca/oai/

harvesterhttp://blah/oai/?verb=ListIdentifiers&metadataPrefix=oai_dc

HTTP Status Code 302

http://blah/oai/?verb=ListIdentifiers&metadataPrefix=oai_dc

<?xml version=“1.0” encoding=“UTF-8”?>…<ListIdentifiers>…</ListIdentifiers>

Page 15: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

DNS Load Balancing

• using a DNS rotor, establish – a.foo.org, b.foo.org, c.foo.org– each with a synchronized copy of the

repository– let DNS & chance distribute the load– implication: if resumptionTokens could

issued to loosely synchronized servers, it is likely that the rTs will be stateful

Page 16: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Load Balancing Caveats• Copies of the repository must be synchronized

– (cf. Pande, et al. JCDL 02)– depending on synchronization method, be careful not

to hide updates• see Aggregator Guidelines…

• Complex hierarchies are possible– programmer must insure no cycles in redirection

graphs!

• The baseURL in the reply must always point to the original repository, not the repository that eventually answered the request

Page 17: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Error Handling: Verbosity

More is better… <error code="badArgument">Illegal argument ‘foo’</error> <error code="badArgument">Illegal argument ‘bar’</error>

is preferred over:

<error code="badArgument">Illegal arguments ‘foo’, ‘bar’</error>

which is preferred over:

<error code="badArgument">Illegal arguments</error>

Page 18: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Error Handling: Levels

• the OAI-PMH error / exception conditions are for OAI-PMH semantic events

• they are not for situations when:– the database is down– a record is malformed

• remember: record = id + datestamp + metadataPrefix• if you’re missing one of those, you don’t have an OAI record!

– and other conditions that occur outside the OAI scope• use http codes 500, 503 or other appropriate values to

indicate non-OAI problems

Page 19: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Error Handling: Extensions• Arguments that are not 'required', 'optional' or

'exclusive’ are 'illegal' and should generate badArgument errors.

• If you want to extend the OAI-PMH:– stop and consider: do you really need to?

• maybe you should have different OAI-PMH interfaces, or creative metadata formats

– if you really, really want to, tunnel your extensions through the “set” feature

• see http://www.dlib.org/dlib/december01/suleman/12suleman.html for examples

Page 20: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Idempotency of “List” Requests (1)

• Purpose is to allow harvesters to recover from lost responses or crashes without starting a large harvest from scratch

• Recover by re-issuing request using resumptionToken from previous request

• IMPLICATION: harvester must accept both the most recent resumptionToken issued and the previous one

Page 21: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Idempotency of “List” Requests (2)

• response to a re-issued request must contain all unchanged records

• any changed records will get new datestamps after time of initial request

• changes will be picked up by subsequent harvest if not included

[no experience yet with incomplete responses to ListSets or ListMetadataFormats requests]

Page 22: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Case Study: “bucket” based repositories• Buckets: see Nelson & Maly, CACM 44(5)• 2.0

– NTRS - ntrs.nasa.gov/ (MySQL, DC)– LTRS - techreports.larc.nasa.gov/ltrs/oai2.0/ (file system, refer)– NACA - naca.larc.nasa.gov/oai2.0/ (file system, refer)

• 1.1– LTRS - techreports.larc.nasa.gov/ltrs/oai/– NACA - naca.larc.nasa.gov/ltrs/oai/– Open Video - www.open-video.org/oai/ (MySQL, local)– JTRS - ston.jsc.nasa.gov/collections/TRS/oai (MS Access dump, local)– GLTRS (filesystem, HTML scraping)

• Characteristics:– resumptionToken support initially skipped; added later (all)

• highly encoded rT’s: [2001-01-01!!!!301!600] – sets initially skipped, added later (LTRS)– initially had load balancing with 2 NACA repositories…

Page 23: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Case Study: “bucket” based repositories

• in bucket terminology:– 6 OAI verbs (methods) added to the existing list

of methods• http://ntrs.nasa.gov/?method=list_methods• http://ntrs.nasa.gov/?method=list_source&target=ListIdentifiers

– a data element is added to the bucket that contains the specifics of the particular repository and its metadata format

• http://ntrs.nasa.gov/?method=display&pkg_name=oai&element_name=oai.pl

Page 24: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Harvester Implementation and Use

(see also Harvester Implementation Guidelines:http://www.openarchives.org/OAI/2.0/guidelines-repository.htm

)

Page 25: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Be a Polite OAI Neighbor• Re-use existing free harvester software/libraries:

http://www.openarchives.org/tools/index.html

• If you insist on writing your own harvester, read http://www.robotstxt.org/wc/robots.html

• Provide meaningful User-Agent & From headers

– Should be present in HTTP headers of all robot requests

– Should be configured even if using someone else’s harvester

Page 26: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Harvesting Sequence• Issue Identify request

– Check OAI-PMH version– Check baseURL, granularity, compression

• Issue ListMetadataFormats request– Get information regarding selected metadataPrefix

• Issue ListSets request if using sets– Check set structure matches expectation

• Issue ListIdentifier or ListRecords request– Continue until end of complete list

Page 27: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Listen to the Repository• Check Identify’s <granularity> element if you wish to use

finer than YYYY-MM-DD• If you harvest with sets, remember that “:” indicates hierarchy

– harvesting “a” will recursively harvest “a:b”, “a:b:c”, and “a:d”

• Check for and handle non-200 HTTP status codes, 503, 302 and 4xx in particular

• Empty resumptionToken => end of complete list• Ask for compressed responses if the repository supports them

Page 28: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Harvesting Everything• Issue an Identify request to find protocol version, finest

datestamp granularity supported, if compression is supported…• Issue a ListMetadataFormats request to obtain a list of all

metadataPrefixes supported.• Harvest using a ListRecords request for each metadataPrefix

supported. Knowledge of the datestamp granularity allows for less overlap in incremental harvesting if granularities finer than a day are supported.

• Set structure can be inferred from the setSpec elements in the header blocks of each record returned (consistency checks are possible).

• Items may be reconstructed from the constituent records. • Provenance and other information in <about> blocks may be re-

assembled at the item level if it is the same for all metadata formats harvested. However, this information may be supplied differently for different metadata formats and may thus need to be store separately for each metadata format.

Page 29: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Harvesting v1.1 and v2.0

• Not difficult to handle both cases, test Identify response:– v1.1: <Identify>

<protocolVersion>

– v2.0 <OAI-PMH> <Identify> <protocolVersion>

• Different error and exception handling

• Many similarities, harvesters can share lots of code

Page 30: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Harvesting logs• Alan Kent’s v2.0 harvester logs:

http://www.inquirion.com:8123/public/collList;collListCmd=list

• Alan Kent’s summary of v1.1 harvesting results http://www.mds.rmit.edu.au/~ajk/oai/interop/summary.htm

• Celestial harvesting logs http://celestial.eprints.org/cgi-bin/status

• DP9 gateway using Arc harvested informationhttp://arc.cs.odu.edu:8080/dp9/index.jsp

Page 31: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

<friends> example (1)A light-weight, data-provider driven way to

communicate the existence of “others”, e.g. http://ntrs.nasa.gov/?verb=Identify

<description>

<friends …namespace stuff… >

<baseURL>http://naca.larc.nasa.gov/oai2.0</baseURL>

<baseURL>http://ntrs.nasa.gov/oai2.0</baseURL>

<baseURL>http://eprints.riacs.edu/perl/oai/</baseURL>

<baseURL>http://ston.jsc.nasa.gov/collections/TRS/oai/</baseURL>

</friends>

</description>…

Page 32: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

http://techreports.larc.nasa.gov/ltrs/oai2.0/ http://naca.larc.nasa.gov/oai2.0/

http://ntrs.nasa.gov/oai2.0/

http://ston.jsc.nasa.gov/collections/TRS/oai/

http://eprints.riacs.edu/perl/oai/

harvester

Identify

<friends> example (2)

<friends>

Page 33: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Use of <friends>

Page 34: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Aggregator / Cache / Proxy Implementation

(see also Aggregator Implementation Guidelines:

http://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm

)

Page 35: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

<provenance> & datestamps

• Reminder: datestamps are local to the repository, a re-exporting service must use new local datestamps

• Such services should use the <provenance> container to preserve the original datestamps and other information

Page 36: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Identifiers are Local

• Identifiers are local to the repository• Unless you absolutely did not change the

metadata and the identifier corresponds to a recognized URI scheme, use a new identifier upon re-exporting– use the <provenance> container to preserve the

harvesting history

Page 37: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

oai-identifier

• Just one option for identifiers in OAI-PMH• The v2.0 oai-identifier scheme is not

compatible with v1.1:– repositoryName now domain name based– not reliant upon OAI centralized registration

• One-to-one mapping for escaping characters: %3F allowed, %3f not– allows simple comparison

Page 38: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Derived from the same item?

3 different ways to determine if records share provenance from the same item:

1. both records have the same identifier and the baseURL in the request elements of the OAI-PMH reponses which include the record are the same;

2. both records have the same identifier and that identifier belongs to some recognized URI scheme;

3. the provenance containers of both records have the same entries for both the identifier and baseURL;

Page 39: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

<provenance> example (1)

<responseDate>2002-02-08T08:55:46.1</responseDate>

<request verb="GetRecord" metadataPrefix="odd_fmt"

identifier="oai:odd.oa.org:z1x2y3">http://odd.oa.org</request>

<GetRecord ...namespace stuff… <record>

<header>

<identifier>oai:odd.oa.org:z1x2y3</identifier>

<datestamp>1999-08-07T06:05:04Z</datestamp>

</header>

<metadata> …metadata record in odd_fmt… </metadata>

</record>

</GetRecord>

Consider a request from crosswalker.oa.org:http://odd.oa.org?verb=GetRecord &identifier=oai:odd.oa.org:z1x2y3&metadataPrefix=odd_fmt

and the following response from odd.oa.org:

Page 40: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Imagine that crosswalker.oa.org cross-walks harvested metadata from odd_fmt into oai_marc and then re-exposes the metadata with new identifiers.

A request from getmarc.oa.org:

http://crosswalker.oa.org?verb=GetRecord &identifier=oai:cw.oa.org:z1x2y3 &metadataPrefix=oai_marc

might then yield the following response from crosswalker.oa.org:

<provenance> example (2)

Page 41: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

<provenance> example (3)<record> <header> <identifier>oai:cw.oa.org:z1x2y3</identifier> <datestamp>2002-02-09T01:15:43Z</datestamp> </header> <metadata> ...metadata record in oai_marc... </metadata> <about> <provenance …namespace stuff… > <originDescription harvestDate="2002-02-08T08:55:46Z“ altered="true"> <baseURL>http://odd.oa.org</baseURL> <identifier>oai:odd.oa.org:z1x2y3</identifier> <datestamp>1999-08-07T06:05:04Z</datestamp> <metadataNamespace>http://odd.oa.org/odd_fmt</..> </originDescription> </provenance> </about></record>

Page 42: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

<provenance> example (4)

This oai_marc record is then re-exposed by getmarc.oa.org with the same identifier oai:cw.oa.og:z1x2y3 (because the record has not been altered).

The associated <provenance> container might be:

Page 43: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

<provenance> example (5)<record>

<header>

<identifier>oai:cw.oa.org:z1x2y3</identifier>

<datestamp>2002-03-01T01:46:11Z</datestamp>

</header>

<metadata> ...metadata record in oai_marc... </metadata>

<about>

<provenance …namespace stuff…>

<originDescription harvestDate=“2002-03-01T01:23:45” altered=“false”>

<baseURL>http://crosswalker.oa.org/<baseURL>

<identifier>oai:cw.oa.org:z1x2y3</identifier>

<datestamp>2002-02-09T01:15:43Z</datestamp>

<metadataNamespace>http://../oai_marc</metadataNamespace>

<originDescription harvestDate="2002-02-08T08:55:46Z” altered="true">

<baseURL>http://odd.oa.org</baseURL>

<identifier>oai:odd.oa.org:z1x2y3</identifier>

<datestamp>1999-08-07T06:05:04Z</datestamp>

<metadataNamespace>http://odd.oa.org/odd_fmt</metadateNamespace>

</originDescription>

</originDescription>

</provenance>

</about>

</record>

Page 44: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

A Deduping Experiment

• “Initial Experiences Re-Exporting Duplicate and Similarity Computation with an OAI-PMH aggregator”– http://www.arxiv.org/abs/cs.DL/0401001

• Summary:– harvest favorite repositories– compute similarities with VSM– re-export metadata with top 10 similarities

for a record in an <about> container• dups, recommendations

Page 45: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

looking ahead: novel uses of OAI-PMH

Page 46: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

• “Using OAI-PMH…Differently” Young, Van de Sompel, Hickey, D-Lib Magazine, 9(7/8), 2003– http://www.dlib.org/dlib/july03/young/07young.html

• DL Usage logs ~ LANL• Registry of metadata formats for OpenURL ~ OCLC & LANL

– http://www.openurl.info/registry/– http://lib-www.lanl.gov/~herbertv/papers/icpp02-draft.pdf

• GSFAD Thesaurus ~ OCLC

• “The multi-faceted use of the OAI-PMH in the LANL Repository”

– http://lib-www.lanl.gov/~herbertv/papers/jcdl2004-submitted-draft.pdf

Page 47: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu
Page 48: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

OAI-PMH access to DL usage logs

• usage logs filtered and stored in MySQL

db

• accessible as 2 OAI-PMH repositories:• document oriented• agent oriented (user-proxy)• interlinked

• recommender system:• harvests logs• interpretes logs• exposes relationships (OpenURL access)

Page 49: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Phase 1: creating recommender system

Documentlogs

local

Agent logs

local

document log agent log

Log processing

Logbased recom.system

About local and remote data

Page 50: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

ISI citations

local

Inspec bibliographic

local

PubMed bibliographic

remote

Biosis bibliographic

local

biblio or

citation

id

recommendedidentifiers

Phase 2: requesting recommendations

Logbased recom.system

rela

teOpenURL

see: http://www.dlib.org/dlib/june02/bollen/06bollen.htmlfor a possible methodology for computing log-based recommendations

Page 51: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

agent

alog:IP:128.1.22.13

Repository 1

docs accessedby agentabout

agent

alog

Page 52: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

document

dlog:ori:pmid:258471

Repository 2

agents accessingthe documentabout

document

dlog

Page 53: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

metadata planeresource1

resource2 resource3

default links

herbert van de sompel

default links:• restricted in nature• action-radius restricted by business agreements• not context-sensitive

Page 54: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

metadata plane

extended services plane

resource1

servicecomponent1

servicecomponent2

default links

appropriate linksOpe

nURL

resource2 resource3

herbert van de sompel

Page 55: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

OAI-PMH-conformant OpenURL Registry

• NISO OpenURL Framework builds on

Registry

• Registry entry:

• unique identifier

• always DC record

• sometimes XHTML or XML Schema

definition

Page 56: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

OAI-PMH-conformant OpenURL Registry

• Collaboration between LANL and OCLC

Office of Research:

• Registry is OAI-PMH harvestable

• Registry is browseable through

overlaying of PURL and XSLT

Page 57: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

registered item

ori:enc:UTF-8

OpenURL Registry

about registered

item

characterencoding

Page 58: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

XML Schemadefining the

metadata format

xsd

registered item

ori:fmt:xml:xsd:book

about registered

item

OpenURL Registrymetadataformat

Page 59: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

<?xml version="1.0" encoding="UTF-8" ?>

<?xml-stylesheet type="text/xsl" href="gui.xsl" ?>

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">

<responseDate>2002-02-08T12:00:01Z</responseDate>

...

</OAI-PMH>

OAI-PMH-conformant OpenURL Registry

• Insert XSLT stylesheet reference in OAI-

PMH response => make repository

browseable

Page 60: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

OAI-PMH-conformant OpenURL Registry

• Use PURL partial redirects to obtain

publishable URLs

baseURL? verb=GetRecord & metadataPrefix=xsd & identifier=ori:fmt:xml:xsd:book

http://www.openurl.info/registry/xsd/ori:fmt:xml:xsd:book

Page 61: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

OAI-PMH-conformant GSFAD Thesaurus

• OCLC Office of Research:

• GSFAD Thesaurus is OAI-PMH

harvestable

• Thesaurus is user-browseable through

overlaying of PURL and XSLT

• Thesaurus is accessible by machines via

OAI-PMH-based web services

Page 62: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Native GSFAD record

MARCXML

concept

Adventure+films

Xwalked GSFADrecord

GSFAD Thesaurus

Page 63: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Other Uses For the OAI-PMH• Assumptions:

– Traditional DLs / SPs will continue on their present path of increasing sophistication

• citation indexing, search results viz, personalization, recommendations, subject-based filtering, etc.

– growth rates remain the same (~5x DPs as SPs)

• Premise: OAI-PMH is applicable to any scenario that needs to update / synchronize distributed state– Future opportunities are possible by creatively

interpreting the OAI-PMH data model

Page 64: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Typical Values• repository

– collection of publications• resource

– scholarly publication• item

– all metadata (DC + MARC)• record

– a single metadata format• datestamp

– last update / addition of a record• metadata format

– bibliographic metadata format• set

– originating institution or subject categories

Page 65: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Repositories…

• Stretching the idea of a repository a bit:– contextually sensitive repositories

• “personalization for harvesters”• communication between strangers, or communication

between friends?

– OAI-PMH for individual complex objects?• OAI-PMH without MySQL?!

– Fedora, Multi-valent documents, buckets– tar, jar, zip, etc. files

Page 66: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Resource

• What if resource were:– computer system status

• uptime, who, w, df, ps, etc.

– or generalized “system” status• e.g., sports league standings

– people• personnel databases• authority files for authors

Page 67: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Item

• What if item were:– software

• union of versions + formats – all forms of metadata

• administrative + structural• citations, annotations, reviews, etc.

– data • e.g., newsfeeds and other XML expressible content

– metadataPrefixes or sets could be defined to be different versions

Page 68: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Record

• What if record were:– specific software instantiations / updates– access / retrieval logs for DLs (or computer systems)– push / pull model inversion

• put a harvester on the client behind a firewall, the client contacts a DP and receives “instructions” on how to submit the desired document (e.g., send email to a specified address)

Page 69: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Datestamp

• semantics of datestamp are strongly influenced by the choice of resource / item / record / metadataPrefix, but it could be used to:– signify change of set membership (e.g., workflow:

item moves from “submitted” to “approved”)– change datestamp to reflect access to the DP

• e.g., in conjunction with metadataPrefixes of “accessed” or “mirrored”

Page 70: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

metadataPrefix

• what if metadataPrefix were:– instructions for extracting / archiving / scraping the

resource• verb=ListRecords&metadataPrefix=extract_TIFFs

– code fragments to run locally• (harvested from a trusted source!)

– XSLT for other metadataPrefixes• branding container is at the repository-level, this could

be record- or item-level

Page 71: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Set• sets are already used for tunneling OAI-PMH

extensions (see Suleman & Fox, D-Lib 7(12))• other uses:

– in aggregators, automatically create 1 set per baseURL– have “hidden” sets (or metadataPrefix) that have

administrative or community-specific values (or triggers)

• set=accessed>1000&from=2001-01-01• set=harvestMeWithTheseARGS&until=2002-05-

05&metadataPrefix=oai_marc

Page 72: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

OAI-PMH & The Deep Web

Page 73: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Race for This New Market…

• Yahoo! & University of Michigan– http://www.umich.edu/news/index.ht

ml?Releases/2004/Mar04/r031004

• Google & CrossRef– http://www.nature.com/nature/focus/a

ccessdebate/17.html

Page 74: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Exposing Repository Contents

• DP9: Webcrawler access to OAI-PMH repositories

• http://dlib.cs.odu.edu/dp9/• JCDL 02 http://www.cs.odu.edu/~liu_x/dp9/dp9.pdf

• An Apache module for OAI-PMH– http://www.modoai.org/

• Extensible Repository Resource Locators (ERRoLs) for OAI Identifiers – http://www.oclc.org/research/projects/oairesolver/def

ault.htm

Page 75: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

DP9: Indexing Repositories with Web Crawlers

• Convert repository to a series of hyperlinks.– Each record has a persistent URL.

• Begin by dynamically creating a starting HTML page for an OAI-PMH repository– Starting page for a data provider would be

constructed by issuing a ListIdentifier request and translating the response into a HTML format containing a series of links to the records

Page 76: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Index OAI Repository by Crawler

• A link on this HTML page, when invoked, would result in another GetRecord request for a specific identifier.

• HTTP 503 and OAI-PMH resumptionToken are also observed.

Page 77: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Architecture

Page 78: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Better Fit for Crawlers

• Format of URLs– http://arc.cs.odu.edu:8080/dp9/getrecord.jsp?

identifier=oai:NACA:1917:naca-report-10 &prefix=oai_dc – http://arc.cs.odu.edu:8080/dp9/getrecord/oai_dc/

oai:NACA:1917:naca-report-10

• HTML Meta tags– Some crawlers (such as Inktomi) use the HTML meta

tags to index a Web pages; DP9 also maps Dublin Core metadata to corresponding HTML meta tags.

– For pages that are designed exclusively for robots navigation, a noindex robots meta tag is used

• X-FORWARDED-FOR header to distinguishbetween different users coming in via his proxy

Page 79: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Results

• 101 repositories with millions of records are open to web crawlers.

• Tens of thousands of documents has been indexed by web crawlers through DP9.

• Web logs show that more than 1000 queries are issued from popular web search engines each day.

Page 80: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Problems

• DP9 does not cache the records and only forwards requests to corresponding data providers.

• Dp9’s records are always up-to-date.

• Quality of service is highly dependent on the availability of data providers.

Page 81: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Problems

• Aggressive crawling.– An aggressive crawler using DP9 can rapidly send

requests without regard for the load they are placing on the data providers.

• Solution.– OAI mirror/caching mechanism such as OAI

aggregator (http://celestial.eprints.org/) in Southampton university.

– HTTP throttle software to relieve the overhead on data providers.

Page 82: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Problems

• PageRank?– resumptionToken may create very deep

links that web crawlers don’t follow.– resumptionToken may be invalid after a

period of time.

• Possible solution– Create many small bins based on the

selective harvesting features of OAI-PMH

Page 83: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Problems

• Long harvesting time.– If a crawler harvests one record per second, it will

take at least 277 hours to harvest a 1M records database.

• Solution.– DP9 is distributed as an open source tool so any OAI-

PMH compliant repository can install it• http://dlib.cs.odu.edu/dp9/

– Current installation.• Southampton University.• University of Pennsylvania.

Page 84: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

mod_oai

• Ongoing project– funded by the Mellon Foundation– www.modoai.org

• An Apache module (written in C) to automatically answer OAI-PMH requests for an http server

Page 85: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

www.getty.edu

doc1; last mod2003-03-12

doc2; last mod2002-07-19

doc3; last mod2003-11-29

doc4; last mod2002-10-03

doc100; last mod2003-09-113…

what documents have beenmodified since 2003-11-15?

Inefficient Web Crawlers

Page 86: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

www.getty.edu with OAI-PMH

doc1; last mod2003-03-12

doc2; last mod2002-07-19

doc3; last mod2003-11-29

doc4; last mod2002-10-03

doc100; last mod2003-09-113…

what documents have beenmodified since 2003-11-15?

A More Efficient Way…

Initial Harvest Weekly UpdatesGoogle Robot 100 connections 100 connectionsOAI-PMH Harvester 10 connections 1 connection

web site = 100 pages; 10 pages per resumptionToken, 5 page updates a week

Page 87: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

ERRoLs

• OCLC to project to assign “Cool URLs” to OAI-PMH records– cf.

http://www.w3.org/Provider/Style/URI.html– a “thin layer” of a service provider to bring

repositories closer to human interaction• cf. the Repository Explorer

Page 88: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Steps

• repository registers with ERRoLs• access OAI-PMH records with:• http://errol.oclc.org/ + <oai-identifier>• ERRoLs also available for

– sites that don’t user <oai-identifier> – repository overview pages

Page 89: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

ERRoLs Examples

• http://errol.oclc.org/oai:xmlregistry.oclc.org:demo/ISBN/0521555132• http://errol.oclc.org/oai:xmlregistry.oclc.org:demo/ISBN/0521555132.html• http://errol.oclc.org/oai:xmlregistry.oclc.org:demo/ISBN/

0521555132.ListMetadataFormats• http://errol.oclc.org/oai:xmlregistry.oclc.org:demo/ISBN/0521555132.resource• http://errol.oclc.org/oai:xmlregistry.oclc.org:demo/ISBN/

0521555132.ListERRoLs

Page 90: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

“Inverted Archives”

• Unit of discourse is no longer an archive or service, but a DOI which has services linked from it– cf.:

• UPS demonstration prototype• “Smart Objects, Dumb Archives” (SODA)

model

Page 91: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Example

http://dx.doi.org/10.1145/374308.374342

Page 92: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Naming

• Fundamental to other technologies (OAI-PMH, OpenURL, etc.)

• Options– URNs– Persistent URLs (PURLs)

• http://purl.org/– Handles

• http://www.handle.net/– Digital Object Identifiers

• http://www.doi.org/– ARK

• http://www.cdlib.org/inside/diglib/ark/

Page 93: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Object Models

Page 94: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Popular Object Models

• METS– used in DSpace, Fedora– http://www.loc.gov/standards/mets/

• MPEG-21 DIDL– http://xml.coverpages.org/mpeg21-didl.html– used in LANL DLs

• http://www.dlib.org/dlib/november03/bekaert/11bekaert.html• http://www.dlib.org/dlib/february04/bekaert/02bekaert.html• http://lib-www.lanl.gov/~herbertv/papers/jcdl2004-submitted-draft.pdf

Page 95: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Object Models & OAI-PMH

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

resource

item oai:foo.edu:1234

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

METS

Move from simple metadata files“pointing” to resources…

…to records as “modeled representations” of resources

records

Page 96: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

OAI-PRH?

• using OAI-PMH for resource extraction / exchange– yes, OAI-PMH is for metadata not resources, but its

going to happen anyway…• mirroring• preservation (archive “zipping”)• convergence with OAIS

– assumptions• a digital resource• rsync et al. neither appropriate nor possible• defer metadata vs. data discussion

Page 97: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Possible Approaches

1. Exploit knowledge outside the scope of the OAI-PMH to extract the resource

2. Base64 encode the resource and transmit via OAI-PMH as a separate “metadata” prefix?

3. Separate metadata prefix with instructions on how to extract / scrape the resource

4. Separate metadata format with XML encoded metadata, along with XSLT to decode it

Page 98: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Out of Band Knowledge

direct pdf

1. take url in dc:identifier2. parse report number 3. append

“reportnumber.pdf” to url

Page 99: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Out of Band Knowledge

• pros: tailored, no “accidental” harvesting• cons: not scalable wrt # of repositories &

harvesters, false negatives

no metadata change

metadata change

no data change

ok unnecessarydownload

data change

missed update!

okassumption: change in metadata means a change in data -- not always true!

Page 100: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Base64 Encoding

• define separate metadata formats – base64:application/pdf– base64:application/powerpoint

• pros: describable with OAI-PMH semantics, accomplished with standard OAI-PMH tools

• cons: heavyweight (could use compression), suitable for simple objects only, accidental harvesting would produce high loads for repositories and harvesters– use complex objects, “modeled representation”,

instead!

Page 101: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Metadata as Instructions

cf. http://genomebiology.com/2003/4/6/R40

Page 102: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Metadata as Instructions

• the resource described in <dc:identifier> could be a complex object– may not be appropriate to:

• “tar” the object into a single file• expose all constituent objects through OAI-PMH

– define a metadata prefix that provides machine readable instructions on how extract the complex object

• METS• MPEG-21 DIDL• SCORM• …

Page 103: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Metadata as Instructions

Page 104: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

XSLT

• if the resource is already XML encoded, include an XSLT to transform into the desired format– use separate metadata formats or even sets for the

harvester to express their transformation preferences?

• pros: elegant, limited work for repository• cons: assumes client-side transformation

capability, applicable only for XML-encodable resources

Page 105: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Static Repositories

Page 106: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Why a Static Repository?

• Not all sites can install CGI or servlets– they can have web access, but either use:

• a restrictive ISP• be firewalled off

• Not all sites need the heavyweight of OAI-PMH software– static repos are not limited by size, but are

generally intended for “smaller” repositories (<5000 records)

• http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm

Page 107: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Static Repository Architecture

gateways can described themselves with the new gateway container:http://www.openarchives.org/OAI/2.0/guidelines-gateway.htm

Page 108: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

OAI-Rights

Page 109: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Not Just for Eprints Anymore…

• OAI-Rights is an ongoing effort:– http://www.openarchives.org/documents/

OAIRightsWhitePaper.html– http://www.openarchives.org/news/oairightspress030929.html

• Issues:– entity association

• metadata vs. resource

– aggregation association• records, repository, sets

– binding• Record’s <about>, Identify’s <description>

Page 110: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Download and Go!

Page 111: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Where Do You Want to Build?

user

. . .dataprovider

dataprovider

dataprovider

dataprovider

serviceprovider

local context-sensitive services

EPrints.org

dataprovider

CDSware

CDSware

Page 112: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

Fedora

• joint project between Cornell & UVa – funded by the Mellon Foundation

• a repository management system– focuses on complex digital objects and their

behaviors

• more info:– http://www.fedora.info/– D-Lib Magazine, 9(4)

• http://www.dlib.org/dlib/april03/staples/04staples.html

Page 113: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

• MIT + HP Labs• constructed to capture all the output of MIT’s

faculty• now generalized to the DSpace Federation

– 8 top universities in the US & Canada

• More info:– http://www.dspace.org/– http://sourceforge.net/projects/dspace/– D-Lib Magazine 9(1)

• http://www.dlib.org/dlib/january03/smith/01smith.html

Page 114: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

EPrints.org

• developed at Southampton University– part of larger suite of institutional/author self-

archiving tools and services• e.g.: citebase; paracite

• widely adopted -- 100+ sites– http://software.eprints.org/#ep2

• more info– http://www.eprints.org/– http://www.arl.org/sparc/core/index.asp?

page=g20#6

Page 115: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

CDSware

• developed at CERN• data provider & service provider• large-scale use @ CERN (> 600k

records)– in use at a few non-CERN sites

• free & paid support models• more info

– http://cdsware.cern.ch/

Page 116: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

• P2P publishing for academia– community servers for coordination, management– archivelets for individual laptops, PCs

• more info:– http://kepler.cs.odu.edu/– D-Lib Magazine 7(4)

• http://www.dlib.org/dlib/april01/maly/04maly.html

Page 117: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

• developed by UKOLN– open source

• OpenURL 0.1 format resolver– NISO 1.0 format???

• more info:– Ariadne, 28

• http://www.ariadne.ac.uk/issue28/resolver/• ftp://ftp.ukoln.ac.uk/metadata/tools/openresolver/• http://www.ukoln.ac.uk/distributed-systems/openurl/

Page 118: Advanced OAI-PMH Michael L. Nelson mln@cs.odu.edu mln/ Several Slides from Herbert Van de Sompel, Simeon Warner and Xiaoming Liu

The Future: Community Building

• Ultimately, protocols and metadata formats are not what makes a difference

• Rather, the critical mass afforded by a common set of utilities (cf. http, Dublin Core, XML)

• The best current example: The Open Language Archives Community – http://www.language-archives.org/

• OAI-PMH provides the basis for communication between strangers, but allows even richer communication between friends