funded by: © ahds sherpa dp – a technical architecture for a disaggregated preservation service...

23
Funded by: © AHDS Sherpa DP – a Technical Architecture for a Disaggregated Preservation Service Mark Hedges Arts and Humanities Data Service King’s College London

Upload: leslie-douglas

Post on 18-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Funded by:

© AHDS

Sherpa DP – a Technical Architecture for a

Disaggregated Preservation Service

Mark HedgesArts and Humanities Data Service

King’s College London

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

SHERPA DP ProjectDevelopment Partners: AHDS at King’s College London

(Lead), Nottingham, Glasgow, Edinburgh, White Rose Consortium, London Leap Consortium

Objective: To create a shared, distributed preservation environment for the SHERPA project framed around the OAIS Reference Model.

Notes:Participating repositories all based on DSpace or EPrints.Relatively simple data objects (eprints).

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Distributed OAIS ModelInstitutional Repository (Content Provider)

Consumer

DataManagement

Ingest

Administration

SIP DIP

ArchivalStorage

ProducerAccess

Preservation Service (Service Provider)

DataManagement

Access

Archival Storage

DIPAIP

Ingest

SIP

Administration

Preservation Planning

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Distributed WorkflowSu b mit d a ta& me ta da ta

Validationsuc c esful

RequestResubmission

N o

Ismetadata

c omplete?

Enhanc eMetadata

Copy SIP torepository

store

E-print inappropr iate

depositformat

Migrate todissemination

formatN o

Transfer D IPto storage

Make availablein c atalogue

Researc her(Consumer)

ac c esses data

Metadatatransfer

C reatetec hnic alm etadata

GenerateAIP

R iska sse ssme n t

Im plem entPres ervation

S trategy

Is s u e sid en tified

N o p ro b le m sid e n tified

Sc heduleO bs oles c enc e

Monitoring

Trans fer AIP toPres ervation

s tore

Service Provider (Preservation Service)

Yes

Yes

Content Provider (Institutional Repository)

Rec ord detailsof migration

ac tion

N o

Yes

Validationsuc c esful

RequestResubmission

N o

Ye s

Datatransfer

Resolveissues

R iska sse ssme n t

Fo rm a t co ns id e re da t-ris k

No

ob

sole

sce

nce

pro

ble

ms

iden

tifie

d

Fo rm a t a t-ris k

Generatereplac em ent

D IP

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

System Architecture

F e d o ra C o reR e p o sito ry Se rv ice

W e b In te rface

R e la tion a l D Blo ca l file syste m

Sh e rpa DP Se rv ice s

In g e stSe rv ice s

Po st- Ing e stSe rv ice s

En q u irySe rv ice s

F e d o ra G e n e r icSe a rch

F e d o ra Se rv ice s

In g e stF u n ctio n s

Po st- Ing e stF u n ctio n s

En q u ir ie s

H T T P

H T T P

SO A PH T T P

SO A PH T T P

SO

AP

HT

TP

SO

AP

HT

TP

Exte rna llyR e fe ren ce d

C o n ten t

Exte rna lse rv ices, e .g .

D R O ID

R E ST

R E ST

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Key preservation actions at ingest

• Integrity/fixity checks.• File format identification.• Preservation metadata creation.• Implement preservation strategy• File format normalisation.• Others …

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Requirements• Scalability: need to handle

increasingly large quantities of data• Generation and management of

extensive set of preservation metadata

• Audit trail/provenance metadata: knowledge held in explicit machine-processable form

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

More Requirements

• Distributed architecture • Integration of specialised tools • Follow standards to allow flexible

integration of future tools• Automate workflow where possible,

but also allow human interaction

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Approach• Web services encapsulating

preservation actions • Web interface for points in the

process where human input required

• Linked by workflow management tool

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Workflow management• Large number of tools available

– Taverna– BPEL (Active BPEL)– jBPM– others …

• Settled on jBPM

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

jBPM• Web services and UI functions chained

together to form a workflow or “Business Process”

• Open source, flexible, extensible workflow management system

• Bridges the gap between users and developers by giving them a common language

• Packaged as a J2EE application - can run on any J2EE application server like JBoss, Tomcat, etc.

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Preservation Metadata• Approach based on PREMIS data

dictionary • PREMIS data model based on five

categories: intellectual entities, objects, agents, events, rights

• Implementing a subset of this model

• … with some format-specific extensions (e.g. MIX for images)

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Available Tools• Stand-alone specialised tools that perform

preservation-related tasks • File format identification, e.g. DROID-

PRONOM– Developed by The National Archives– Identification of file formats based on their file

signatures

• Technical metadata generation, e.g. JHOVE– Extensible framework for format validation– Perform format-specific identification,

validation, and characterization of a digital object

• File format migration tools (e.g. XENA, Open Office)

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Available tools and workflow• Tools written in different languages• Define generic interfaces for

preservation actions• Wrap the tools used as web services to

promote:– Interoperability– Loose coupling, flexibility– Reusability

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Workflow in jBPM

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

jBPM (jPDL)

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Node and ActionHandler

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Workflow Inputs & Outputs

WORKFLOW

SUBMISSIONINFORMATION

PACKAGE(SIP)

ARCHIVALINFORMATION

PACKAGE(AIP)

DISSEMINATIONINFORMATION

PACKAGES(DIPs)

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Workflow Outputs• Multiple METS packages (atomic

model), each containing (some of):– data– Descriptive metadata– PREMIS object metadata (technical)– PREMIS event metadata– PREMIS relationship metadata– Format-specific technical metadata

(e.g. MIX)

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Fedora object modele- p r in t

o r ig in a lm an if es ta t io n

n o r m alis edm an if es ta t io n

h asM

anife

sta t

ion h asM

an ife s ta tion

o r ig in a l f ile 1 o r ig in a l f ile 2m ig r a ted f ile 1

u p d a ted v er s io no f e - p r in t

h as Ver s io n

ha sP a r

t

h a sP a r t h asP

a rt h a sP a r t

h a sP a r t

hasPart

isD e riv e d F ro m

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Issues with automation• Preserving content – what do we actually

want to preserve?• Significant properties – soft concept,

hard to quantify (INSPECT)• Lack of suitable tools – expensive,

outputs unreliable

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Next Steps• SHERPA DP 2 (2007-2008), looking at:

- Additional repository types- More complex object types- different methods of data transfer

• Generalise system• Add post-ingest preservation actions• Add semantics for dynamic service

discovery• Resource discovery metadata generation

Funded by:

© AHDS

     

Digital repositories: Dealing with the digital deluge, Manchester, 5 June 2007

Questions

Contact: [email protected]