dspace 4.2 transmission: import/export

55
DSpace 4.2 Advanced Training – Content Transmission DSpace 4.2 Advanced Training by James Creel is licensed under a Creative Commons Attribution 4.0 International License. Special thanks to the DuraSpace Foundation and the Texas Digital Library for making this course possible.

Upload: duraspace

Post on 17-Jul-2015

397 views

Category:

Technology


0 download

TRANSCRIPT

DSpace 4.2 Advanced Training –Content Transmission

DSpace 4.2 Advanced Training by James Creel is licensed under a Creative Commons Attribution 4.0 International License. Special thanks to the DuraSpace Foundation and the Texas Digital Library for making this course possible.

Module Outline

• Harvesting and Disseminating with OAI/PMH

• Reading content with REST

• Export and Import with SIPs

• Depositing content with SWORD

• Importing content with the Simple Archive Format (SAF)

Introduction to Harvesting

• Open Archives Initiative

• Protocol for Metadata Harvesting

• Object Reuse and Exchange

• Harvesting with DSpace XMLUI

• Choice of collection source

• Replicate metadata (OAI-PMH) or metadata + data (OAI-PMH + OAI-ORE)

• What an excellent way to rapidly populate one’s repository!

Introduction to Harvesting

• Go ahead and create a new collection wherever you please.

• We will be harvesting content from remote DSpace repositories.

• Having created the collection, one is taken to the edit view. Click the tab for Content Source

Configure the content source

How do we learn about the harvest source?• Point your browser to http://repository.tamu.edu/dspace-oai/request?verb=ListSets to see a list of collections at TAMU.

• There are several interesting verbs for which an OAI server will grant requests-

• Point your browser to http://www.openarchives.org/OAI/openarchivesprotocol.html for details

• In the 1.8.x days, one would need to keep that page open when trying to craft queries to OAI. Under 3.x and higher, there is a lovely stylesheet courtesy of Lyncode that makes typical queries easy and automatic.

Configuring the Content Source

• A sample OAI Provider – OAK Trust: The Texas A&M Digital Repository: http://repository.tamu.edu/dspace-oai/request

• OAI Set spec: com_1969.1_5670

• Test the settings to make sure things are copasetic, then save.

Save your pointers to the external OAI service. Select Import Now…

Voila!

Your oaiwebapp provides a machine-readable dissemination service.• Try some requests:

• http://localhost:8080/oai/request?verb=Identify

• http://localhost:8080/oai/request?verb=ListMetadataFormats

• http://localhost:8080/oai/request?verb=ListSets

• http://localhost:8080/oai/request?verb=ListRecords&metadataPrefix=ore

We can experiment with harvesting from each other’s repositories

• From your command line, run ipconfig

• Your ip address will be listed as the IPv4 address

• You can craft a OAI request URL for your server using the ip address as the host name.

• If you like, invite a neighbor to harvest one of your collections.

Automating Harvesting (1/3)

• Requests to harvest large collections can easily time out.

• This calls for a scheduler that runs independently of the browser.

• Find it in the XMLUI under

the control panel.

Automating Harvesting (2/3)

• When automated, the harvester will conduct its activity on all collections that are configured to harvest.

• Once started, the harvester will operate at regular intervals as specified by harvester.harvestFrequency in modules/oai.cfg.

Automating Harvesting (3/3)

• Start – initiate the periodic process

• Pause – wait for the current operation to complete, then suspend further operations

• Stop – wait for the currently harvested item to complete, then suspend further operations (which will likely break further harvests of the containing collection)

• Reset Harvest Status – clears the status of each harvested collection so that they may be initiated anew

Which formats are available to your harvester?

• This is configurable in [dspace-install-dir]\modules\oai.cfg under the harvester.oai.metadataformats.[declared-metadata-format-name] values

• Where [declared-metadata-format-name] is declared in your xoai.xml

• Let’s add “rdf” to that list and try harvesting with it.

Dissemination – Metadata Crosswalks

• Metadata in DSpace exist in key-value pairs with field names given by the metadata registry.

• Fields may be exported in the formats that oai indicates from the ListMetadataFormats verb.

• Dissemination crosswalks are encoded as XSL files inside the [dspace-install-dir/config/crosswalks]directory

• The .properties seem to have stopped being used for OAI dissemination since DSpace went to version 3.x

• The crosswalks are active in specific contexts that can be configured.

Configuring Metadata Crosswalks –XOAI Configuration Entities• Open up the C:\dspace\config\crosswalks\oai\xoai.xmlfile with jEdit.

• The top level Configuration element contains <Contexts>, <Formats>, <Transformers>, <Filters>, and <Sets>.

• Each of these contain, in turn, what you would expect -<Context> elements, <Format> elements, <Transformer> elements, <Filter> elements, and <Set> elements.

• Each of these does its own thing.

Configuring Metadata Crosswalks –XOAI Configuration – Setting up Contexts• The <Context> element refers to instances of all the other

elements.

• The baseurl attribute determines how to address the context in your url path

• The <Format> elements name the crosswalks to be available

• The <Transformer> element names a stylesheet to apply to the final XML output

• The <Filter> elements name Java classes that will eliminate results unacceptable to the context

• The <Set> element appears simply to alias the set of all records in the context.

Configuring Metadata Crosswalks –XOAI Configuration – Setting up Formats• The <Format> elements have an id attribute which allows

them to be referenced in the <Context>

• They also contain, minimally, a

• <Prefix> by which they are addressed in OAI requests

• <XSLT> designating the xsl file doing the crosswalk

• And should include

• <Namespace> designating the namespace of XML output

• <SchemaLocation> designating the schema specification of that XML

Configuring Metadata Crosswalks –XOAI Configuration – Setting up Transformers

• The <Transformer> element contains an id attribute by which it is referenced in the <Context> and an <XSLT> element designating its XSL file.

Configuring Metadata Crosswalks –XOAI Configuration – Setting up Filters• The <Filter> elements contain an id attribute by which

they are referenced in the <Context> and

• <Class> which names the java class doing the filtering

• <Parameter> with a key attribute and one or more <Value> elements that are used to parameterize the filtering method.

Configuring Metadata Crosswalks –XOAI Configuration – Setting up Sets• The <Set> element has the usual id attribute and

• <Pattern> which renders as the set spec in the OAI response

• <Name> which renders as the set’s name

Exercise – A Custom Context

• Let’s imagine a use case where there is a requirement to be harvested by a vendor or partner.

• Only items with certain fields are suitable for their index (for example, those with a title, author, and type)

• Create a new context with an appropriate filter.

Configuring Metadata Crosswalks –Styling for Human Readability• The webapps\oai\static\style.xsl stylesheet is used to render

the OAI responses in a nice readable format with the links of interest also provided.

• One may also change the stylesheet being used by OAI by changing the stylesheet attribute of the<Configuration> root element of xoai.xml.

• Let’s experiment with some changes to the style –

• New branding

• Links to each of the contexts

The REST Webapp (1/4)

• Representational State Transfer – A scaleable, simple approach to web services.

• Stateless on the server side – client maintains any session data

• Cacheable – responses should indicate whether the client can save them in a web cache

• Layerable – Client need not know or care whether the server is behind a proxy

• Simple, Uniform Requests – resources identifiable by URI, responses report their format and their cacheability

The REST Webapp (2/4)

• Read Only in 4.x

• JSON or XML depending on your HTTP Header: Accept

• Possible values are application/xml and application/json

• Your browser may default to one or the other, but your application code (or developer’s browser) can specify.

• Communities, Collections, Items and Bitstreams are queryableresources

• The ?expand query parameter followed by a comma delimited list will provide more detail than the default queries

The REST Webapp (3/4)

• Communities

• /rest/communities lists all

• /rest/communities/:id gets one

• ?expand possibilities: parentCommunity, collections, subCommunities, logo, all

• Collections

• /rest/collections lists all

• /rest/collections/:id gets one

• ?expend possibilities: parentCommunityList, parentCommunity, items, license, logo, all

The REST Webapp (4/4)

• Items

• /items/:id lists one

• ?expand possibilities: metadata, parentCollection, parentCollectionList, parentCommunityList, bitstreams, all

• Bitstreams

• /bitstreams/:bitstreamID lists one

• /bitstreams/:bitstreamID/retrieve to download

• ?expend possibilities: parent, all

The DSpace Packager

• Utilized with the dspace packager command-line script

• Submission Information Packages

• Dissemination Information Packages

Submission Packages (SIPs)

• Four package formats supported by default:

• DSpace Archival Information Package (AIP) – used for backing up and restoring DSpace repository content

• DSPACE-ROLES – used for backing up and restoring DSpace groups and epersons

• METS – A zipfile containing MODS descriptive metadata and designating content bitstreams and their disposition

• PDF – A single PDF file can be considered a package (supposing its embedded metadata are suitable

Submission Packages (SIPs)

• An example – importing a PDF as a package

• Track down a pdf on the interwebs – here’s one!

• http://hdl.handle.net/1969.1/2313

• Copy it to [dspace-install-dir] i.e. C:\dspace

• Learn about the packager with the C:\dspace\bin\dspace packager --help --type PDF command

• Can you craft the command to make the submission?

Submission Packages (SIPs) –PDF example• We need a –t for type, -p for parent collection, -e for eperson

email, and finally the name of the “package”

• Once this succeeds, however, the quality of the metadata is likely to be very poor indeed! Embedded metadata are seldom well populated.

Submission Packages (SIPs)

• An example – importing a METS package

• Of interest as this is also the package used by default for SWORD deposits

• Find the file mets-sip-example.zip in the W:\Development\resources directory.

• Copy it to [dspace-install-dir] i.e. C:\dspace

• Learn about the packager with the C:\dspace\bin\dspace packager --help --type METS command

• Can you craft the command to make the submission?

Submission Packages (SIPs) –METS example• We need at least the –t flag for type, -p for parent collection, -

e for eperson, and finally the filename of the package.

• C:\dspace\bin\dspace packager –t METS –p [collection-handle] –e [email protected]

Dissemination Packages (DIPs)

• DSpace Archival Information Package

• DSPACE-ROLES

• METS

• No need to export PDFs, we might suppose.

• As a final packaging exercise, use the packager to disseminate an item. This will require the additional –i (identifier, i.e. handle of the object) and –d (disseminate instead of the default, submit)

• Can you craft the command?

Dissemination Packages (DIPs)

• A successful dissemination:

• Let’s complete the circle by submitting this package to another (or even the same) collection.

SWORD

• Simple Web Service Offering Repository Deposit

• DSpace comes with servers for v1 and v2

• Big innovation of v2 is ability to update items, but client support is currently limited

• Accessible via a client or (e.g.) a cURL command.

• Accepts deposits via METS packages by default

• Requires an administrative eperson account

SWORD – accessing via cURLcommand• A cURL executable is provided at W:\Development\curl-

7.37.0-win32\

• Copy that directory to your own C:\Development\.

• This command is an extremely robust tool that enables communication of data over protocols with and without encryption – we here are interested just in HTTP today.

SWORD – accessing via cURLcommand – getting the servicedocument

• Clues to the meaning may be found at http://curl.haxx.se/docs/manpage.html

SWORD – accessing via cURLcommand – Making a deposit• A long, long command indeed…

• curl • -i

• --data-binary "@mets-sip-example.zip"

• -H "Content-Disposition: filename=mets-sip-example.zip"

• -H "Content-Type: application/zip"

• -H "X-Packaging: http://purl.org/net/sword-types/METSDSpaceSIP"

• -H "X-No-Op: false“

• -H "X-Verbose: true“

• --user "[email protected]:admin" http://localhost:8080/sword/deposit/123456789/26

SWORD – accessing via cURLcommand – Making a deposit• Find that text in the W:\Development\resources\curl-deposit-

notes file.

• In an amusing turn of events, this deposit will fail from most of our localhost machines, as behind the scenes the SWORD server will attempt to write a temporary file named after your IP address which contains colon characters which are illegal in Windows filenames.• This can be gleaned from the

C:\Development\tomcat\logs\localhost.[today].log

• Instead, let’s experiment with deposits to other servers in the room.

SWORD – Bringing up the DSpace Client• Activate the aspect in xmlui.xconf

• Target repositories are configured in the [dspace-install-dir]\config\modules\sword-client.cfg file

SWORD – Utilizing the DSpaceSWORD Client• Serves at this time only to copy existing items to another

SWORD-enabled repository.

• To utilize, navigate to the item’s page while logged in as an administrator.

• Let’s try some

deposits to

localhost and

our neighbors.

SWORD – Looking Forward to Sword v2 in Practice• Sword v2 offers the capability to change the content and

metadata of previously deposited items

• Java libraries for the client are available, but I have not seen an implemented GUI.

• cURL usage is also theoretically quite possible, but also looks like a little bit of heavy lifting.

Batch Imports

• DSpace Simple Archive Format (SAF)

• The DSpace import script

• Adding items

• Replacing items

• Deleting items

• Importing from real sources

• Example: CSV

• Example: MARC XML

DSpace SAF (1/3) - Overview

• The top level directory contains one directory for each item in the batch.

• Each item directory must contain:

• The bitstream files

• A contents manifest contents

• A metadata file dublin_core.xml

• Optionally, other metadata files with names like metadata_[schema].xml where [schema] is the schema’s name.

Scott Phillips provides a fine guide at http://www.scottphillips.com/2009/05/howto-dspace-batch-ingest/

DSpace SAF (2/3) – Contents Manifest• The contents manifest contents names each bitstream

that will be in the item as well as it’s disposition:

• Bundle

• Permissions

• Primacy

DSpace SAF (3/3) – Metadata

• The SAF uses a specific XML format for the encoding of Dublin Core style metadata.

• dublin_core.xml

• metadata_[schema].xml where [schema] is another metadata schema in your repository’s registry

• The containing element is dublin_core with a schemaattribute.

• The field elements are dcvaluewith schema, element, and qualifier attributes.

Example imports…

• Provided are some rough code examples that will parse a CSV metadata file (and associated content files) or a MARC XML file (and associated content files).

• The code examples are in Java and best comprehended in a nicely configured development environment, but we can work with them using jEdit and the command line.

• We will conduct these imports into the repository and consider the advantages and disadvantages of the approach.

An example import: CSV

• Create the import processor application in your C:\Development\SAFCreator directory

• mvn clean package

• Run it with java –jar target\SAFCreator-0.0.1-SNAPSHOT.one-jar.jar

• You will be presented with a Java Swing interface where you can specify a csv metadat a file, a directory for source files, and directory for SAF output, and other details for the batch.

An example import: CSV

• Import the SAF as follows:

• c:\dspace\bin\dspace import -a -e [email protected] -s c:\Development\SAF\test-output -c 123456789/2 -m c:\Development\SAF\test-output\map.map

An example import: MARC XML

• This example may be found in the import/marc directory

• Create the program with

• javac –sourcepath . *.java

• jar cfm xslimporter.jar manifest.mf *.class

• Run with

• java –jar xslimporter.jar

To see a common import difficulty, attempt an import as we did for the CSV example.

-This will result in some schema-related errors, a very common problem when doing imports.

An example import: MARC XML

• Add the following to a new thesis metadata schema and re-attempt the import.

• degree.name

• degree.level

• degree.discipline

• degree.department

Consider the Import Results

• Idiosyncrasies of certain field values are more apparent in different syntactic contexts.

• Different metadata origins entail different complexities in the processing.

• Importation into a digital repository is a crucial step in the life of a digital resource, as it is a chance to refine metadata, after which it can be easily transmitted via crosswalks.

• However, it is a time when metadata are at risk of loss for lack of care.

Final Thoughts on Content Transmission• Along with preservation, one of the greatest services provided

by digital repositories

• Yet, like preservation, good transmission requires constant work

• Crosswalks must be maintained to standards as well as local practices

• Our means of importing content are constantly improving but face a moving target

• New collection types inevitably require new development work if their ingestion is to be automated