a centre of expertise in data curation and preservation imeche workshop, london, 26 th september...
TRANSCRIPT
a centre of expertise in data curation and preservation
IMechE Workshop, London, 26th September 2006
Looking to the longer term: some perspectives on data curation
and preservation
This work is licensed under a Creative Commons LicenceAttribution-ShareAlike 2.0
Funded by:
Dr Liz Lyon,
DCC Associate Director Outreach Director, UKOLN, University of Bath, UK
About UKOLN
• “a centre of expertise in digital information management”• Funding: Joint Information Systems Committee (JISC) +
Museums, Libraries & Archives Council (MLA)• Portfolio of R&D projects Delos, DRIVER, Grand Challenge• 29+ staff based at the University of Bath• Inform the library, information, education and cultural
heritage communities• Policy, advocacy at national level, build innovative Web-
based systems & services, R&D, e-journal Ariadne, workshops and conferences.
• http://www.ukoln.ac.uk/
Acknowledgement: Alex Ball, Grand Challenge Project
UK Digital Curation Centre
• Digital Curation Centre• Funded by JISC & EPSRC• Development activities• Research agenda• Delivering services• Outreach Programme• http://www.dcc.ac.uk/
a centre of expertise in data curation and preservation
IMechE Workshop, London, 26th September 2006
Overview• Data curation and digital preservation issues • Draw on research and scholarship
perspectives• Data / information flows and the “business
process”• UK Digital Curation Centre activities
“maintaining and adding value to a trusted body of digital information for current and
future use”
Data-centric 2020 vision
Reference datasets as infrastructure?
(Very simple) Product Research Cycle & Data Curation
Formulate ideas / hypothesis, test, experiment, observe, design: data
creation, collection & capture
Adding value: Data linking, annotation,
visualisation, simulation
(New) knowledge extraction: data mining, modelling, analysis, synthesis
e-Infrastructure
Open ?? access
Collaboration
Scholarly communications & Business transactions: data disclosure, publication, citation, discovery, re-use
Data management storage & validation: description, deposit,
self-archiving, preservation,
certification
Data processing
Data processingData processing
Data processing
Data processing
This work is licensed under a Creative Commons LicenceAttribution-ShareAlike 2.0
Maintenance Engineer Aircraft Lands
Visual Inspection
Provide Information
Quote Diagnos is
Brief Diagnos is / Prognos is
Check Diagnoses
Maintenance Procedure
Diagnos is Result
Release Engine
complete
Maintenance Result
Maintenance Analys t (Fleet Manager)
Detailed Diagnos is / Prognos is
Provide Further Details
Reques t Information
Sign-off Diagnos is
Analys t Decis ion
[ information required ]
[ diagnosis ]
DAME signal processing workflows using Grid Services
Domain Expert
Detailed Analys is
[ unknown ]
Reques t Further Details
Expert Decis ion
[ known ][ Clear ]
[ unknown ]
[ information required ]
[ diagnosis ]
[ fault unresolved ]
[ fault resolved ]
Rolls RoyceDS&SAirport
• RepoMMan: Repository Metadata and Management (Hull) using WS-BPEL
• Are your engineering workflows identified and described?
Workflowe-Scientist desktop?
Slide: Carole Goble
Research outputs in institutional repositories: engineering
“JISC Vision”: a global landscape of federated repositories
fusion layer ‘repository federator’
repository repository repository repository repository
portal portal portal portal portal
heterogeneous - metadataformats, content formats,identifiers, packagingstandards
homogeneous - metadataformats, content formats,identifiers, packagingstandards
From Andy Powell: http://www.ukoln.ac.uk/distributed-systems/jisc-ie/arch/presentations/jiie-jcs-2005/
• Multi-disciplinary, cross-sectoral
• National, institutional
• Different platforms
• Many format types: data, eprints, images, geospatial
• e-Framework and Information Environment context
• Define common + domain-specific + repository “services”
• Interoperability based on open standards, software tools
Pilot Engineering Repository Xsearch PerX http://www.engineering.ac.uk/
a centre of expertise in data curation and preservation
IMechE Workshop, London, 26th September 2006
STEP ISO10303
Interoperability???
Repositories and OAIS Reference Model“an archive consisting of an organisation of people and systems that has
accepted the responsibility to preserve information and make it available for a Designated Community..an identified group of potential consumers who
should be able to understand a particular set of information”
4-1
.2
MANAGEMENT
Ingest
Data Management
SIP
AIPDIP
queries
result setsAccess
PRODUCER
CONSUMER
Descriptive Info
AIP
orders
Descriptive Info
Archival Storage
Administration
Preservation Planning
Assuring permanence: digital preservation• Trusted DR Audit Checklist for Certification Draft Research Libraries Group-NARA Taskforce 2005
Defined criteria: – Organisation– Functions, processes & procedures– Designated community & usability– Technologies & technical infrastructure
• Revised Checklist based on feedback and pilot audits (KB, BADC)
• Self-certification: DINI-Zertifikat: requirements & recommendations:– Server policy / Guidelines– Author support– Legal issues– Authenticity and integrity– Cataloguing– Access statistics– Long-term sustainability
• Has your repository / PLM been audited?
Interdisciplinary discovery• Validation, publication & discovery of data
models & schema• Harmonisation and normalisation of
metadata and semantics• Packaging standards: METS,
MPEG-21 DIDL• Formal high-level and domain ontologies• ePrints DC Application Profile
http://www.ukoln.ac.uk/repositories/digirep/index/Eprints_Application_Profile
• eBank Application Profile crystallography data http://www.ukoln.ac.uk/projects/ebank-uk/schemas/
• What data models and metadata schema are in place?
Persistent identifiers for data citation• How will they be used? We need use cases: depositor, author,
service provider, researcher, publisher?• Schemes: DOI, Handle, ARK, PURL• Global identification: express as http URIs• Data citation (human and machine-actionable)• Publication & citation of scientific primary data project National
Library for Science & Technology (TIB), University of Hanover, Germany. STD-DOI Project DOI registry for datasets http://www.std-doi.de
• Is there a data citation policy?
• What persistent identifiers have been assigned to your data?
Discovering data: eBank Project
Coles, S.J., Day, N.E., Murray-Rust, P., Rzepa, H.S., Zhang, Y., Org. Biomol. Chem., 2005, (10),1832-1834. DOI: 10.1039/b502828k
• Domain identifier: International Chemical Identifier (INChI) code• Google molecule using INChISlide from Simon Coles
Domain identifiers for engineering?
Format migration challenges? CAD Program Compatibility Chart http://www.okino.com/conv/filefrmt_cad.htm
Registry development
Development: Representation Information Registry Repository
• “DCC Approach to Digital Curation” based on OAIS• Representation Information Registry Repository • Prototype demonstrator: based on 2 key concepts to facilitate
sharing of the curation effort– Curation Persistent Identifier (CPID)– Descriptive “label” (structural, semantic, other metadata)
• Development of (M2M) tools and interfaces for creating, using and re-using representation information
• http://dev.dcc.ac.uk Wiki and email list
• EU CASPAR Integrated Project
• Task Force on the Permanent Access to the Records of Science http://www.casparpreserves.info/pages/1/index.htm
http://tfpa.kb.nl/
Registry APIAllows applications to talk to many different registry implementations e.g. GDFR, PRONOM, UDDI
•GUI Access and via Web browser http://registry.dcc.ac.uk
Adding value through annotation Research at the University of Edinburgh
• Scientific databases: Annotation scoping report
• New annotation model + prototype MONDRIAN
• Intuitive visual interface iMONDRIAN
• Annotate sets of values
• Support for querying annotations
Nature 23 March 2006 OTMI: Open Text Mining Interface
NaCTeMhttp://www.nactem.ac.uk/
Emerging tools: TerMine, GENIA, Cafetiere
Knowledge extraction:• Mining (data, text, structures)
• Modelling (economic, climate, mathematical, biological…)
• Analysis (statistical, lexical, gene….)
Supporting the community: Services• [email protected] • legal - technical guidance • Curation Manual 45 chapters planned
– Metadata (umbrella)– Open Source– Archival metadata– Preservation metadata– Selection & appraisal– Curating emails
• Briefing Papers– Curating emails – Digital repositories – Geospatial data – Data protection – eScience data
• Case studies
a centre of expertise in data curation and preservation
IMechE Workshop, London, 26th September 2006
DCC Case Study published: Wide Field Astronomy Unit
Supporting the community: Outreach & Services • Workshops:
• Geospatial data, NeSC, 27 October• OAIS 5 year Review, October• Audit & Certification Forum, October• Records Management, L’pool 30 Nov• Curation & Preservation Training, Dec• 2007 Preservation of journals tbc• 2007 Legal environment tbc• 2007 Preparing for audit tbc
• Information Days British Library L’pool UCL
• 2nd International DCC Conference 21-22 November, Glasgow
• Keynotes: Hans F. Hoffmann, CERN, Clifford Lynch, CNI
a centre of expertise in data curation and preservation
IMechE Workshop, London, 26th September 2006
DCC Phase 2: 2007-2010• Working more closely with data centres, e-Science
Programmes and Research Councils• SCARP Project: disciplinary approach• JISC Digital Repository Programme collaboration• RepInfo Registry service migration• Define self-assessment procedures and tools• Collaborate with CASPAR, DPE and PLANETS (EU-
funded Digital Preservation Projects)• Workshop Programme, International Conference 2007
University of Bath, 13 September 2006
a centre of expertise in data curation and preservation
Thank you.Questions?
Join the DCC Associates Network at www.dcc.ac.uk