outline harvesting metadataroytennant.com/presentations/older/2005il/harvesting/harvesting.pdf ·...

9
1 Harvesting Metadata Using OAI-PMH Roy Tennant, The California Digital Library Internet Librarian 2005, Monterey, CA Harvesting Metadata Using OAI-PMH Roy Tennant California Digital Library Outline The Open Archives Initiative OAI-PMH The Harvesting Process Harvesting Problems Steps to a Fruitful Harvest A Harvesting Service Model The OAI Future Open Archives Initiative Aimed at making the large and growing number of repositories of freely available digital content interoperable Protocol for Metadata Harvesting (OAI-PMH) specifies how repositories can expose their metadata for others to harvest Over 800 repositories world-wide support the protocol OAIster.org has indexed nearly 6 million items from over 500 of those repositories www.oaforum.org/tutorial/ OAI-PMH Data providers (DP) — those with the stuff Service providers (SP) — those who harvest metadata and provide aggregation and search services Software for both DPs and SPs readily available OAI-PMH verbs: Identify ListIdentifiers ListMetadataFormats ListSets ListRecords GetRecord OAI Architecture Source: Open Archives Forum Tutorial

Upload: others

Post on 17-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outline Harvesting Metadataroytennant.com/presentations/older/2005il/harvesting/harvesting.pdf · Harvesting Metadata Using OAI-PMH Roy Tennant, The California Digital Library Internet

1

Harvesting Metadata Using OAI-PMHRoy Tennant, The California Digital Library

Internet Librarian 2005, Monterey, CA

Harvesting MetadataUsing OAI-PMH

Roy TennantCalifornia Digital Library

Outline

The Open Archives InitiativeOAI-PMHThe Harvesting ProcessHarvesting ProblemsSteps to a Fruitful HarvestA Harvesting Service ModelThe OAI Future

Open Archives InitiativeAimed at making the large and growing numberof repositories of freely available digital contentinteroperableProtocol for Metadata Harvesting (OAI-PMH)specifies how repositories can expose theirmetadata for others to harvestOver 800 repositories world-wide support theprotocolOAIster.org has indexed nearly 6 million itemsfrom over 500 of those repositories

www.oaforum.org/tutorial/

OAI-PMHData providers (DP) — those with the stuffService providers (SP) — those who harvestmetadata and provide aggregation and searchservicesSoftware for both DPs and SPs readily availableOAI-PMH verbs:

IdentifyListIdentifiersListMetadataFormatsListSetsListRecordsGetRecord

OAI Architecture

Source: Open Archives Forum Tutorial

Page 2: Outline Harvesting Metadataroytennant.com/presentations/older/2005il/harvesting/harvesting.pdf · Harvesting Metadata Using OAI-PMH Roy Tennant, The California Digital Library Internet

2

Harvesting Metadata Using OAI-PMHRoy Tennant, The California Digital Library

Internet Librarian 2005, Monterey, CA

IdentifyProvides basic information about arepository

ListMetadataFormatsLists available metadata formats

ListIdentifiersLists all identifiers (or only those of theoptionally specified set)Must include metadataPrefix attribute

ListSetsLists available sets

Library of Congress ListSets response

Page 3: Outline Harvesting Metadataroytennant.com/presentations/older/2005il/harvesting/harvesting.pdf · Harvesting Metadata Using OAI-PMH Roy Tennant, The California Digital Library Internet

3

Harvesting Metadata Using OAI-PMHRoy Tennant, The California Digital Library

Internet Librarian 2005, Monterey, CA

ListRecordsLists all records (or only those of theoptionally specified set)Must include metadataPrefix attribute

GetRecordRetrieves a specific recordMust include metadataPrefix and identifierattributes

The Harvesting Process

Identifying SourcesSelecting SetsHarvestingMetadata ProcessingIndexingInterface

A Harvesting Service Model

gita.grainger.uiuc.edu/registry/

errol.oclc.org

Page 4: Outline Harvesting Metadataroytennant.com/presentations/older/2005il/harvesting/harvesting.pdf · Harvesting Metadata Using OAI-PMH Roy Tennant, The California Digital Library Internet

4

Harvesting Metadata Using OAI-PMHRoy Tennant, The California Digital Library

Internet Librarian 2005, Monterey, CA

Selecting Sets

Review the response to theListSets verbMay be instructive to search thecollection in the native interface, ifpossibleLook for descriptive pages on thesite being harvested

Harvesting

Many harvesting applications areavailable, I will focus on:

Public Knowledge Project (PKP) HarvesterVirginia Tech Perl Harvester

Library software vendors increasinglyoffer harvesting products (e.g., ExLibris’MetaIndex)

+-----------------------------------------+| Harvester Sample Configurator |+-----------------------------------------+| Version 1.1 :: July 2002 || Hussein Suleman <[email protected]> || Digital Library Research Laboratory || www.dlib.vt.edu :: Virginia Tech |------------------------------------------+

Defaults/previous values are in brackets - press <enter> to accept thoseenter "&delete" to erase a default valueenter "&continue" to skip further questions and use all defaultspress <ctrl>-c to escape at any time (new values will be lost)

Press <enter> to continue

[ARCHIVES]Add all the archives that should be harvested

Current list of archives:No archives currently defined !

Select from: [A]dd [D]oneEnter your choice [D] : a{return}

[ARCHIVE IDENTIFIER]You need a unique name by which to refer to the archive youwill harvest metadata fromExamples: nsdl-380602, VTETD

Archive identifier [] : nsdl-380602{return}

Virginia Tech Perl Harvester

Page 5: Outline Harvesting Metadataroytennant.com/presentations/older/2005il/harvesting/harvesting.pdf · Harvesting Metadata Using OAI-PMH Roy Tennant, The California Digital Library Internet

5

Harvesting Metadata Using OAI-PMHRoy Tennant, The California Digital Library

Internet Librarian 2005, Monterey, CA

Let’s Harvest! Indexing

Pick your favorite database/indexingsoftware:

MySQLSWISH-EWhatever is lying around…

May need to specifically set up a methodto search across the entire recordMay need different fields for indexingthan for displayWill need to deal with element collision

Interface

Software interface (API) for otherapplications:

SRU/SRW?MXG?Arbitrary Web Services schema?

User interface:What functions do you want your users to beable to perform?What kinds of displays do you want?

Harvesting Problems

SetsMetadata FormatsMetadata ArtifactsGranularityMetadata Variances

Sets

Records are harvested in clumps,called “sets” created by DPsNo guidelines exist for defining setsExamples:

CollectionOrganizational structureFormat (but is a page image animage? See example)

Metadata Formats

Only required format is simpleDublin Core, although any formatcan be made available in additionFew DPs surface richer metadataSimple DC is simply too simple!Example (artifact vs. surrogatedates)

Page 6: Outline Harvesting Metadataroytennant.com/presentations/older/2005il/harvesting/harvesting.pdf · Harvesting Metadata Using OAI-PMH Roy Tennant, The California Digital Library Internet

6

Harvesting Metadata Using OAI-PMHRoy Tennant, The California Digital Library

Internet Librarian 2005, Monterey, CA

Metadata Artifacts

“unintended, unwanted aberrations”Sample causes:

Idiosyncratic local practicesAnachronismsHTML code

Examples:Circa = string of dates for searching purposes[electronic resource]

Granularity

Record Granularity: what is an“object”?

A book, or each individual page?Examples: CDL, Univ. of Michigan

Metadata Granularity:Multiple values in one fieldExample: Univ. of Washington

Metadata Variances

Subject terminology differencesDisparities in recording the samemetadata

Example: date variancesMapping oddities or mistakes

Examples: 1) format into description,2) description into subject

Steps to a Fruitful Harvest

Needs Assessment (it’s the user, stupid)DP Identification and CommunicationMetadata CaptureMetadata AnalysisMetadata SubsettingMetadata NormalizationMetadata EnrichmentIndexing & DisplayInterface (it’s still the user, stupid)

Needs Assessment

What are you trying to accomplish?What will your users want to be able todo?What metadata will you need, and whatprocedures will you need to set up toenable these activities?Which repositories have what you want?Is what they have (e.g., sets, metadata)usable as is, or ?

DP Identification & Communication

Identification:Use UIUC directory of DPs to identifypotential sources

Communication:Not required to tell them you are harvesting,but may help establish a good relationshipMay want to request that they surface aricher metadata format and/or provide adifferent set

Page 7: Outline Harvesting Metadataroytennant.com/presentations/older/2005il/harvesting/harvesting.pdf · Harvesting Metadata Using OAI-PMH Roy Tennant, The California Digital Library Internet

7

Harvesting Metadata Using OAI-PMHRoy Tennant, The California Digital Library

Internet Librarian 2005, Monterey, CA

Metadata Capture

Sample questions to answer:Individual sets, or all?Richer metadata formats available?How frequently to reharvest?Start from scratch each time orupdate?

Many software options

Metadata Analysis

Finding out what you have (anddon’t have)

Encoding practicesGap analysis (e.g., missing fields, etc.)Mistakes (e.g., mapping errors)

Software can helpCommercial software like SpotfireIn-house or open source software tools

Source: 2002 Master’s Thesis, Jewel Hope Ward, UNC Chapel Hill

Five elements are used 71% of the time

Page 8: Outline Harvesting Metadataroytennant.com/presentations/older/2005il/harvesting/harvesting.pdf · Harvesting Metadata Using OAI-PMH Roy Tennant, The California Digital Library Internet

8

Harvesting Metadata Using OAI-PMHRoy Tennant, The California Digital Library

Internet Librarian 2005, Monterey, CA

Metadata Subsetting

DP sets are unlikely to serve all SPuses wellSPs will need the ability to subsetharvested metadata

Metadata Normalization

Normalizing: to reduce to astandard or normal statePrototype date normalizationservice screen

Metadata EnrichmentAdding fields and/or qualifiers may beuseful or required, for example:

Metadata provider informationGeographic coverageSubject terms mapped to a differentthesaurusAuthority control record

The enrichment process may be thesame tool as the subsetting tool (i.e.,find a cluster of records and perform anaction)

Indexing & Display

Selected fields may need to be mappedto specific indexing and display elementsParticularly required if harvestingdifferent metadata formatsBut also needs to be done with multiple,conflicting fields: <date>1863.</date>

<date>[2001 or 2002.]</date>

<identifier>SHS 1,679</identifier><identifier>http://content.lib.washington.edu/cgi-bin/htmlview.exe?CISOROOT=/loc&CISOPTR=58</identifier><identifier>http://content.lib.washington.edu/loc/image/1679.jpg</identifier>

A Harvesting Service Model

Page 9: Outline Harvesting Metadataroytennant.com/presentations/older/2005il/harvesting/harvesting.pdf · Harvesting Metadata Using OAI-PMH Roy Tennant, The California Digital Library Internet

9

Harvesting Metadata Using OAI-PMHRoy Tennant, The California Digital Library

Internet Librarian 2005, Monterey, CA

The OAI Future

Further protocol developmentServices layered on top of OAI-PMHShared software toolsBest practices for both DPs and SPs

oai-best.comm.nsdl.org