implementing archivematica, research data network

Implementing Archivematica for research data preservation at York and Hull

Jenny Mitcham (Digital Archivist) - University of York

Jisc RDN event - 06 September 2016

What I’m going to cover

This is a presentation in 4 parts:

1. Background to our project2. Implementing Archivematica3. The challenges of preserving research data4. Future plans

Part one: The Filling the Digital Preservation Gap project

Filling the digital preservation gap:Project aim

“…to investigate Archivematica and explore how it might be used to provide digital preservation functionality within a wider infrastructure for Research Data Management.”

Project structure• Phase 1 – explore: testing, research,

thinking -produce a report (3 months)• Phase 2 – develop: make

Archivematica better for RDM, plan implementation - report (4 months)

• Phase 3 – implement: set up proof of concepts at York and Hull and further investigation of file format problem (6 months)

The teamUniversity of Hull:• Chris Awre – Head of Information Services,

Library and Learning Innovation• Richard Green – Independent Consultant• Simon Wilson – University Archivist

University of York:• Julie Allinson – Manager, Digital York• Jen Mitcham – Digital Archivist

Artefactual Systems

Funded by Jisc (Research Data Spring)

Part two: Implementing Archivematica

What are we trying to achieve?Demonstrate that it is possible to:• pull metadata from PURE / pull content from Box• capture further data to help us manage the dataset• automatically initiate ingest by Archivematica• set up Archivematica to package the data up for longer term preservation (automatically)• provide a dissemination copy of the data for our Hydra repository

...basically what we said in our implementation plans

In addition…• Keep an eye on the broader picture– How can preservation processes for research data be used for other materials e.g., archives

• Consider different use cases for research data organisation on deposit– Single file, multiple files, hierarchical files, etc.– With or without associated metadata

• Share experiences across two institutions with different environments

How did we approach it?We wanted to work in a way that:• was useful to others• was open and accessible• had the bigger picture in mindSo we are:• sharing code on github• working in google docs• engaging Hydra and Archivematica communities• blogging and talking at events like this

What does it look like? York

What does it look like? Hull

What were the challenges?• mostly time!– recruiting suitably skilled developer at short notice– relying on Artefactual Systems who have their own list of priorities and timescales– working with local IT department and different priorities

• outstanding tasks from phase 2 which needed further development• integration/APIs (eg with PURE and Box)

What worked well?• Re-using existing code (rather than re-inventing the wheel)– The puree gem from Lancaster University: this is a way of pulling metadata out of PURE and it saved us a huge amount of work– Automation tools from Artefactual Systems: a lightweight method of automating transfers within Archivematica. We funded a webinar about this in phase 2 of our project.

• Flexibility and capacity in house to do the work

Part three: The challenges of preserving files that we can’t identify

A quick look at file formats

Research data file formats are:• Numerous• Sometimes a bit obscure• Sometimes very big• Ever-changing• Often very newThis means they can be hard to preserve... The first hurdle is that we can’t identify them. If we can’t identify them how can we carry out preservation activities?

Research data applications in use at York

The NDSA Levels of Digital Preservation:

Level 2 requires you to know what you’ve got ...and levels 3 and 4 build on this

Can we identify our research data?

We ran Droid* over the research data deposited with us over the past year. Out of 3752 individual files:• only 37% (1382) of the files were identified (with varying degrees of accuracy)• there were 34 different identified file formats in the sample

* Droid is a free tool from The National Archives that can be used to automatically identify file formats

Identified research data filesFiles identified by Droid (listed by file type)...note that native files of the software in the previous graph of research data applications are not represented

Unidentified research data files• Files not identified by Droid (listed by file ext)• 107 different file extensions not identified– huge number with no extension (help!)– how do we solve the .dat file problem?

What is the project doing to solve the file identification problem?

• We have sponsored the development of 8 new file format signature records in PRONOM for different types of research data• We have created our own research data file signatures for inclusion in PRONOM (and blogged about it to encourage others to do the same)• We have been talking to TNA about how to engage the community more

Part four: Future plans

Future plans• We have a week left to finish our active project work (eeeek!)• ...and look out for our phase 3 report in mid October (and other dissemination outputs)• We need to work out how to move from ‘proof of concept’ to production– York will be establishing how to move seamlessly from this project into the Jisc Shared Service–Hull will be using the work to inform a City of Culture digital archive

Where to find out more

Do talk to me if you are interested in finding out more about this project

Useful links:Project website: http://www.york.ac.uk/borthwick/archivematicaDigital archiving blog: http://digital-archiving.blogspot.co.uk/Archivematica: https://www.archivematica.org/en/Phase 1 report http://dx.doi.org/10.6084/m9.figshare.1481170Phase 2 report https://dx.doi.org/10.6084/m9.figshare.2073220

http://www.york.ac.uk/borthwick/archivematica

http://www.york.ac.uk/borthwick/archivematica

http://digital-archiving.blogspot.co.uk/

https://www.archivematica.org/en/

http://dx.doi.org/10.6084/m9.figshare.1481170



https://dx.doi.org/10.6084/m9.figshare.2073220

implementing archivematica, research data network

Education