implementing archivematica, research data network
TRANSCRIPT
Implementing Archivematica for research data preservation at York and Hull
Jenny Mitcham (Digital Archivist) - University of York
Jisc RDN event - 06 September 2016
What I’m going to cover
This is a presentation in 4 parts:
1. Background to our project2. Implementing Archivematica3. The challenges of preserving research data4. Future plans
Part one: The Filling the Digital Preservation Gap project
Filling the digital preservation gap:Project aim
“…to investigate Archivematica and explore how it might be used to provide digital preservation functionality within a wider infrastructure for Research Data Management.”
Project structure• Phase 1 – explore: testing, research,
thinking -produce a report (3 months)• Phase 2 – develop: make
Archivematica better for RDM, plan implementation - report (4 months)
• Phase 3 – implement: set up proof of concepts at York and Hull and further investigation of file format problem (6 months)
The teamUniversity of Hull:• Chris Awre – Head of Information Services,
Library and Learning Innovation• Richard Green – Independent Consultant• Simon Wilson – University Archivist
University of York:• Julie Allinson – Manager, Digital York• Jen Mitcham – Digital Archivist
Artefactual Systems
Funded by Jisc (Research Data Spring)
Part two: Implementing Archivematica
What are we trying to achieve?Demonstrate that it is possible to:• pull metadata from PURE / pull content from Box• capture further data to help us manage the dataset• automatically initiate ingest by Archivematica• set up Archivematica to package the data up for longer term preservation (automatically)• provide a dissemination copy of the data for our Hydra repository
...basically what we said in our implementation plans
In addition…• Keep an eye on the broader picture– How can preservation processes for research data be used for other materials e.g., archives
• Consider different use cases for research data organisation on deposit– Single file, multiple files, hierarchical files, etc.– With or without associated metadata
• Share experiences across two institutions with different environments
How did we approach it?We wanted to work in a way that:• was useful to others• was open and accessible• had the bigger picture in mindSo we are:• sharing code on github• working in google docs• engaging Hydra and Archivematica communities• blogging and talking at events like this
What does it look like? York
What does it look like? Hull
What were the challenges?• mostly time!– recruiting suitably skilled developer at short notice– relying on Artefactual Systems who have their own list of priorities and timescales– working with local IT department and different priorities
• outstanding tasks from phase 2 which needed further development• integration/APIs (eg with PURE and Box)
What worked well?• Re-using existing code (rather than re-inventing the wheel)– The puree gem from Lancaster University: this is a way of pulling metadata out of PURE and it saved us a huge amount of work– Automation tools from Artefactual Systems: a lightweight method of automating transfers within Archivematica. We funded a webinar about this in phase 2 of our project.
• Flexibility and capacity in house to do the work
Part three: The challenges of preserving files that we can’t identify
A quick look at file formats
Research data file formats are:• Numerous• Sometimes a bit obscure• Sometimes very big• Ever-changing• Often very newThis means they can be hard to preserve... The first hurdle is that we can’t identify them. If we can’t identify them how can we carry out preservation activities?
Research data applications in use at York
The NDSA Levels of Digital Preservation:
Level 2 requires you to know what you’ve got ...and levels 3 and 4 build on this
Can we identify our research data?
We ran Droid* over the research data deposited with us over the past year. Out of 3752 individual files:• only 37% (1382) of the files were identified (with varying degrees of accuracy)• there were 34 different identified file formats in the sample
* Droid is a free tool from The National Archives that can be used to automatically identify file formats
Identified research data filesFiles identified by Droid (listed by file type)...note that native files of the software in the previous graph of research data applications are not represented
Unidentified research data files• Files not identified by Droid (listed by file ext)• 107 different file extensions not identified– huge number with no extension (help!)– how do we solve the .dat file problem?
What is the project doing to solve the file identification problem?
• We have sponsored the development of 8 new file format signature records in PRONOM for different types of research data• We have created our own research data file signatures for inclusion in PRONOM (and blogged about it to encourage others to do the same)• We have been talking to TNA about how to engage the community more
Part four: Future plans
Future plans• We have a week left to finish our active project work (eeeek!)• ...and look out for our phase 3 report in mid October (and other dissemination outputs)• We need to work out how to move from ‘proof of concept’ to production– York will be establishing how to move seamlessly from this project into the Jisc Shared Service–Hull will be using the work to inform a City of Culture digital archive
Where to find out more
Do talk to me if you are interested in finding out more about this project
Useful links:Project website: http://www.york.ac.uk/borthwick/archivematicaDigital archiving blog: http://digital-archiving.blogspot.co.uk/Archivematica: https://www.archivematica.org/en/Phase 1 report http://dx.doi.org/10.6084/m9.figshare.1481170Phase 2 report https://dx.doi.org/10.6084/m9.figshare.2073220