the adore federation architecture

33
The aDORe Federation Architecture Herbert Van de Sompel, Ryan Chute, Luydimilla Balakireva Open Repositories 2008, University of Southampton, UK, April 1-4 2008 The aDORe Federation Architecture Herbert Van de Sompel (1) , Ryan Chute (2) , Luydimilla Balakireva (3) Digital Library Research & Prototyping Team Research Library Los Alamos National Laboratory (1) [email protected] http://public.lanl.gov/herbertv/ (2) [email protected] (3) [email protected] Acknowledgments: Jeroen Bekaert, Patrick Hochstenbach, Henry Jerez, Xiaoming Liu

Upload: herbert-van-de-sompel

Post on 08-May-2015

2.011 views

Category:

Education


2 download

TRANSCRIPT

Page 1: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The aDORe Federation Architecture

Herbert Van de Sompel (1), Ryan Chute (2), Luydimilla Balakireva (3)

Digital Library Research & Prototyping TeamResearch Library

Los Alamos National Laboratory

(1)[email protected]://public.lanl.gov/herbertv/

(2) [email protected](3) [email protected]

Acknowledgments:Jeroen Bekaert, Patrick Hochstenbach, Henry Jerez, Xiaoming Liu

Page 2: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The Fedora Adoration Architecture?

(the mandatory joke at the start of a presentation)

Page 3: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Presentation Based on the Paper

Herbert Van de Sompel, Ryan Chute, Patrick Hochstenbach.The aDORe Federation: Digital Repositories at Scale. 40pages. International Journal on Digital Libraries.Special Issue on Very Large Digital Libraries. InPublication, 2008.

Preprint available at http://arxiv.org/abs/0803.4511

Page 4: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The aDORe Project: Background

• Initial motivation:o Severe deficiencies in the new information discovery

environment developed for the LANL Research Library:- Metadata-centric: descriptive metadata records first class citizens;

actual digital assets auxiliary data.- Tens of millions of digital assets stored as files in file system.- Tight integration between content collection and discovery application,

preventing other applications from leveraging the rich content base.o Obvious solution:

- Replace metadata-centric approach by compound object approach.- Bundle digital assets into storage containers that dramatically reduce

the amount of files in a file system.- Cleanly separate storage repository from applications that leverage the

stored assets by providing necessary machine interfaces.• Implementation of the obvious solution led to the aDORe R&D

project 2003-2007

Page 5: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The aDORe Project: Major Drivers

• Concrete need to design and implement a solution to ingest,store, access the vast and growing collection of the LANLResearch Library.

o Scale, scale, scale!o Existing open source solutions (at that time) did not meet our

scale requirements- e.g. static binding of disseminators to objects in Fedora.

• Interest in repository interoperability, cf. involvement in OAI-PMH, NISO OpenURL, OAI-ORE

• Interest in digital preservation, cf. NDIIP funding

Page 6: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The aDORe Project: Major Design Constraints

• Leverage existing standards and technologies to makedevelopment and migration more straightforward.

o Read: Laziness as a strategy• Use a distributed, component based approach to meet

challenges of scale.

Page 7: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The aDORe Project: Some of the Results

• Conceptual:o aDORe Federation architecture: A high-level, 3-Tier architecture

for the federation of distributed repositories.

• Concrete:o The aDORe Archive storage solution (XMLtapes/ARCfiles) -

Tier-1 of the aDORe Federation architecture.o The aDORe Federation software - an implementation of the 3-

Tier architecture, with the aDORe Archive in Tier-1.

• And a lot more, but that’s not for today.

Page 8: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The aDORe Federation software

• Available today at:

http://african.lanl.gov/aDORe/aDOReFederation

• Thanks to Ryan Chute & Luydimilla Balakireva• This is a major update to the aDORe Archive:

o Updates the Tier-1 aDORe Archiveo Implements the 3 Tiers of the architecture instead of only Tier-1

• In production at LANL Research Library for over 1 year

• Attractive for large collections of relatively stable objects• Could be used as a plug-in storage component for IR solutions

Page 9: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The aDORe Archive @ LANL, March 31 2008

• 87,000,000 Compound Digital Objects• 208,000,000 Stored bitstreams• ~ 9,200 autonomous repositories:

o ~ 4,000 XMLtapes: XML-based serializations of DigitalObjects

o ~ 5,200 ARCfiles: bitstreams• > 500,000,000 identifiers

• More about the aDORe Federation software later

Page 10: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The aDORe Federation Architecture: Goal

• Facilitate a uniform manner for client applications to discover andaccess content objects available in a group of distributedrepositories.

• Single repository behavior for a group of distributed repositories.

• Note that these distributed repositories can very well be “hidden”and that only the federated result is made “public”.

o Cf. Peter Murray-Rust departmental repositories

• Not about uniform approaches to add, update, delete objects inrepositories.

o Considered the responsibility of individual repositories.o However, changes are made apparent to the federation.

Page 11: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

aDORe Federation Architecture

Page 12: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Basic Design Choices

• All entities in the architecture are identified by means of URIs:content objects, repositories, machine interfaces.

o Turns entities into uniquely identified Web resourceso Avoids unwanted identifier collisions, for example, for different

content objects from various repositories.• Depending on use case, certain Content Objects are identified

by either:o Protocol-based URIso Non-protocol-based URIs

• All machine-interfaces are (HTTP) protocol-based

Page 13: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Content Objects

• 3 types of Content Objects:o Digital Objects,o Datastreams,o Surrogates

• These types do not need to be natively embraced byrepositories; they are supported at federation-facing machineinterfaces. They are abstractions.

• Other properties of Content Objects may be expressed, but thearchitecture only serves to convey them.

• Core enabling properties of the architecture:o identification,o location,o time-stampof these Content Objects

Page 14: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Content Objects: Digital Object

• Cf. Kahn-Wilensky, cf. most repository systems• A Digital Object is an identified aggregation of one or more

Datastreams and properties pertaining to the Datastreams and tothe aggregation.

• A Digital Object is the perspective of a repository’s nativecompound object that is shared with the federation.

Page 15: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Content Objects: Digital Object

• Identification: DO-URIo Inherited from other environment and/or minted by repository.

- Cf Internet Archive HTTP URIs of stored objects; identifiers assigned toscanned images

o One or more per Digital Objecto Digital Objects with same DO-URI may exist in multiple repositories

- Cf paper with same DOI in multiple IRs; HTTP URI in Internet Archiveo Protocol-based, non-protocol-based

- http://some.repo.org/do/1234- info:some-repo/do/1234

o Always treated as non-protocol-based, i.e. never resolved (in thefederation) using its native resolution protocol, but rather conveyed asparameter in request against the federation’s machine interfaces.

- Cf Internet Archive.• Time-stamping:

o Digital Object change over timeo Changes communicated to federation via Surrogates

Page 16: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Content Objects: Surrogates

• A Surrogate is the serialization of a Digital Object into a machine-readable representation that is made accessible by a repository.e.g. DIDL, METS, ORE Atom or RDF/XML.

• Surrogates are the vehicles repositories use to keep the federationinformed about the availability of their Digital Objects and aboutchanges those Digital Objects undergo.

• One or more Surrogates can correspond with a Digital Object in afederation:

o Digital Object with same URI may exist in mutliple repositorieso Single repository may have multiple Surrogates for a Digital Object

• Minimally expresses:o DO-URI of the DO it serializes,o Datastream-URI/URLs of constituent Datastreams,o Identifier of the Surrogate itself.

Page 17: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Content Objects: Surrogates

• Identification: Identifier minted by the repository that makes theSurrogate available.

o Protocol-based: Surrogate-URL (native resolution)o Non-protocol-based: Surrogate-URI (resolution via interfaces)

• Time-stamping: Surrogate-datetime changes when a change to aDigital Object needs to be communicated to the federation.Minimally when constituency changes.

• Update Policies:o New Surrogate Policy:

- Change to Digital Object => New Surrogate, new Surrogate-URI/URL, new Surrogate-datetime

- Previous Surrogate remains availableo Update Surrogate Policy

- Change to Digital Object => Update Surrogate, same Surrogate-URI/URL, new Surrogate-datetime

- Previous Surrogate no longer available

Page 18: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Content Objects: Datastreams

• A Datastream is a retrievable bitstream of whichever media typemade available by a repository to the federation.

• It is a perspective of a repository’s native bitstream that is sharedwith an aDORe federation.

• Can be, e.g.:o Dissemination of locally stored bitstream,o Dissemination of externally stored bitstream,o Result of applying a service to a (local or external) bitstream.

• A Datastream can be part of multiple Digital Objects, but there isone repository in the federation that owns/serves it.

Page 19: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Content Objects: Datastreams

• Identification: Identifier minted by the repository that makes theDatastream available.

o Protocol-based: Datastream-URL (native resolution)o Non-protocol-based: Datastream-URI (resolution via interfaces)

• Time-stamping: Datastream-datetime changes when a change toa Datastream needs to be communicated to the federation.

• Update Policies:o New Datastream Policy:

- Update of retrievable bitstream => new Datastream, newDatastream-URI/URL, new Datastream-datetime

- Cf. digital preservation: migration of a JPEG image (URI-1)leads to JPEG-2000 (URI-2), original JPEG maintained.

o Update Datastream Policy- Update of retrievable bitstream => same Datastream, same

Datastream-URI/URL, new Datastream-datetime- Original bitstream no longer available

Page 20: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

aDORe Federation Architecture: Tier-1

Page 21: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Tier-1: Surrogate and (sometimes) Datastream Repositories

• Surrogate Repositories, Datastream Repositories as well as theirInterfaces identified by URI

• Interfaces leverage identification, time-stamping of Content Objects• Datastream Repository only when using (non-protocol-based)

Datastream-URIs

Page 22: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

aDORe Federation Architecture: Tier-1

Page 23: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

aDORe Federation Architecture: Tier-1

Page 24: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Tier-1: Locate Surrogates

• Use: Repositories that have multiple Surrogates for a givenDigital Object, or that have Digital Object that shareDatastreams.

• http://some.repo.org/openurl?url_ver=z39.88-2004&rft_id=http://some.repo.org/ds/5678&svc_id=info:ourfederation/svc/LocateSurrogates.

Page 25: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Tier-1: Obtain Datastream

• Use: For repositories that use Datastream-URIs, notDatastream-URLs

• http://some.repo.org/openurl?url_ver=z39.88-2004&rft_id=info:some-repo/ds/5678&svc_id=info:ourfederation/svc/ObtainDatastream.

Page 26: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

aDORe Federation Architecture: Tier-2

Page 27: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Tier-2: Service Registry

• Keeps track of all components in the federation. In essence 2look-up tables.

• Look-up table 1:o URI of component (e.g. Repository-URI)o Matching Interface-URIs (and Interface type)

• Table 2o Interface-URIo Interface-URL

• Exposes minimally 1 interface, Obtain Registry Record• Cf Information Environment Service Registry

Page 28: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

Tier-2: Identifier Locator

• Look-up table:o Identifiers of Content Objects in the federationo Identifiers of Datastream or Surrogate Repositories that make

these Content Objects accessibleo Necessarily will store this information for non-protocol-based

identifiers- Minimally DO-URI (remember: treated as non-protocol-based)

• Populated by recurrently interacting with Harvest Surrogates andHarvest Datastream Identifier interfaces of all Tier-1 repositories.

• Identifier Locator knows about these interfaces via the ServiceRegistry.

• Exposes minimally 1 interface, Locate Repositories:o In: DO-URI, Surrogate-URI, Datastream-URIo Out: List of Repository-URIs

Page 29: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

aDORe Federation Architecture: Tier-3

Page 30: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

aDORe Federation software

Page 31: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The aDORe Archive, new release

• An updated aDORe Archive Installer containing:o XMLtape Toolkit (updated)o ARCfile Toolkit (updated)o XMLtape Registry (updated)o ARCfile Registry (updated)o XMLtape OpenURL Resolver (new)

- Provides Core Surrogate Services (e.g. Obtain, Locate, HarvestIdentifiers) through a simple and efficient OpenURL Service Interface.

o XMLtape OpenURL XQuery Resolver (new)- Provides a configuration-based solution for complex ad-hoc queries.

Built upon Nux <http://dsd.lbl.gov/nux/>, an open-source Java toolkitthat provides a scalable solution for non-indexed based search of largeXML repositories

Page 32: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The aDORe Archive, new release

• A new aDORe Federation Installer providing:o An aDORe Archive installationo IESR-based Service Registry (new)

- Uses the Ockham Service Registry IESR-based database schema andprovides OAI-PMH and OpenURL Services.

o Identifier Locator (new)- A fast, in-memory MySQL-based solution used for efficient resolution of

Digital Object, Datastream, and Surrogate URIs to Repository URIs.o OAI-PMH Federator (new)

- Provides access to multiple aDORe Archive installations through acommon OAI-PMH interface.

o OpenURL Disseminator (new)- OpenURL Service interface providing federated access to all repository

content, as well as performs dynamic dissemination services using arule-engine based plug-in framework.

Page 33: The aDORe Federation Architecture

The aDORe Federation ArchitectureHerbert Van de Sompel, Ryan Chute, Luydimilla Balakireva

Open Repositories 2008, University of Southampton, UK, April 1-4 2008

The aDORe Archive, new release

• The documentation will provide:o Detailed presentations of available service interfaces and

available pluggable interfaces.o A sample ingestion of DIDL content to illustrate the various key

service interfaces available in the aDORe Federationo A tutorial showing how to create processing implementations

and the necessary configurationso A public demo version of the aDORe Federation using public

domain content (coming soon)