auditing distributed preservation networks

64
Auditing Distributed Digital Preservation Networks Prepared for CNI Fall Meeting 2012 Washington, D.C. December 2012 Micah Altman, Director of Research, MIT Libraries Non Resident Senior Fellow, The Brookings Institution Jonathan Crabtree, Assistant Director of Computing and Archival Research HW Odum Institute for Research in Social Science, UNC

Upload: micah-altman

Post on 11-Nov-2014

1.623 views

Category:

Technology


0 download

DESCRIPTION

This presentation, delivered at CNI 2012, summarizes the lessons learned from trial audits of a several production distributed digital preservation networks. These audits were conducted using the open source SafeArchive system, which enables automated auditing of a selection of TRAC criteria related to replication and storage. An analysis of the trial audits demonstrates both the complexities of auditing modern replicated storage networks, and reveals common gaps between archival policy and practice. Recommendations for closing these gaps are discussed, as are extensions that have been added to the SafeArchive system to mitigate risks in distributed digital preservation (DDP).

TRANSCRIPT

  • 1. Prepared for CNI Fall Meeting 2012 Washington, D.C. December 2012Auditing Distributed Digital Preservation Networks Micah Altman, Director of Research, MIT Libraries Non Resident Senior Fellow, The Brookings Institution Jonathan Crabtree, Assistant Director of Computing and Archival Research HW Odum Institute for Research in Social Science, UNC
  • 2. Collaborators* Nancy McGovern Tom Lipkis & the LOCKSS Team Data-PASS Partners ICPSR Roper Center NARA Henry A. Murray Archive Dataverse Network Team @ IQSSResearch Support Thanks to the Library of Congress, the National Science Foundation, IMLS, the Sloan Foundation, the Harvard University Library, the Institute for Quantitative Social Science, and the Massachusetts Institute of Technology. * And co-conspirators Auditing Distributed Digital Preservation 2 Networks
  • 3. Related WorkReprints available from: micahaltman.com M. Altman, J. Crabtree, Using the SafeArchive System: TRAC- Based Auditing of LOCKSS, Proceedings of Archiving 2011, Society for Imaging Science and Technology. Altman, M., Beecher, B., & Crabtree, J. (2009). A Prototype Platform for Policy-Based Archival Replication. Against the Grain, 21(2), 44- 47. Auditing Distributed Digital Preservation 3 Networks
  • 4. Preview Why? distributed digital preservation? audit? SafeArchive: Automating Auditing Theory vs. Practice Round 0: Calibration Round 1: Self-Audit Round 2: Self-Compliance (almost) Round 3: Auditing Other Networks Lessons learned: practice & theory Auditing Distributed Digital Preservation Networks 4
  • 5. Why distributed digital preservation? Auditing Distributed Digital Preservation Networks 5
  • 6. Slightly Long Answer: Things Go WrongPhysical & Hardware Software Insider & External Attacks Organizational Failure Media Auditing Distributed Digital Preservation Networks Curatorial Error 6
  • 7. Potential Nexuses for Preservation Failure Technical Media failure: storage conditions, media characteristics Format obsolescence Preservation infrastructure software failure Storage infrastructure software failure Storage infrastructure hardware failure External Threats to Institutions Third party attacks Institutional funding Change in legal regimes Quis custodiet ipsos custodes? Unintentional curatorial modification Loss of institutional knowledge & skills Intentional curatorial de-accessioning Change in institutional missionSource: Reich & Rosenthal 2005 Auditing Distributed Digital Preservation 7 Networks
  • 8. The Problem Preservation was once an obscure backroom operation of interest chiefly to conservators and archivists: it is now widely recognized as one of the most important elements of a functional and enduring cyberinfrastructure. [Unsworth et al., 2006] Libraries, archives and museums hold digital assets they wish to preserve, many unique Many of these assets are not replicated at all Even when institutions keep multiple backups offsite, many single points of failure remain, Auditing Distributed Digital Preservation Networks 8
  • 9. Why audit?Auditing Distributed Digital Preservation Networks 9
  • 10. Short Answer: Why the heck not? Dont believe in anything you hear, and only half of what you see - Lou Reed Trust, but verify. - Ronald Reagan Auditing Distributed Digital Preservation Networks 10
  • 11. Full Answer:Its our responsibility Auditing Distributed Digital Preservation Networks 11
  • 12. OAIS Model Responsibilities Accept appropriate information from Information Producers. Obtain sufficient control of the information to ensure long term preservation. Determine which groups should become the Designated Community able to understand the information. Ensure that the preserved information is independently understandable to the DC Ensure that the information can be preserved against all reasonable contingencies, Ensure that the information can be disseminated as authenticated copies of the original or as traceable back to the original Makes the preserved data available to the DC Auditing Distributed Digital Preservation Networks 12
  • 13. OAIS Basic Implied Trust Model Organization is axiomatically trusted to identify designated communities Organization is engineered with the goal of: Collecting appropriate authentic document Reliably deliver authentic documents, in understandable form, at a future time Success depends upon: Reliability of storage systems & services: e.g., LOCKSS network, Amazon Glacier Reliability of organizations: MetaArchive, DataPASS, Digital Preservation Network Document contents and properties: Formats, Metadata, Semantics, Provenance, Authenticity Auditing Distributed Digital Preservation Networks 13
  • 14. Enhancing Reliability through Trust Engineering Incentives: Social engineering Rewards, penalties Recognized practices; shared norms Incentive-compatible mechanisms Social evidence Modeling and analysis: Reduce provocations Statistical quality control & reliability Remove excuses estimation, threat-modeling and Regulatory approaches vulnerability assessment Disclosure; Review; Certification; Audits Portfolio Theory: Regulations & penalties Diversification (financial, legal, technical, Security engineering institutional ) Increase effort for attacker: harden target Hedging (reduce vulnerability); increase Over-engineering approaches: technical/procedural controls; , Safety margin, redundancy remove/conceal targets Informational approaches: Increase risk to attacker: surveillance, Transparency (release of information detection, likelihood of response permitting direct evaluation of Reduce reward: deny benefits, disrupt compliance); common knowledge, markets, identify property Crypto: signatures, fingerprints, non- repudiation Auditing Distributed Digital Preservation Networks 14
  • 15. Audit [aw-dit]: An independent evaluation of records and activities to assess a system of controlsFixity mitigates risk only if used for auditing. Auditing Distributed Digital Preservation Networks 15
  • 16. Functions of Storage Auditing Detect corruption/deletion of content Verify compliance with storage/replication policies Prompt repair actions Auditing Distributed Digital Preservation Networks 16
  • 17. Bit-Level Audit Design Choices Audit regularity and coverage: on-demand (manually); on object access; on event; randomized sample; scheduled/comprehensive Fixity check & comparison algorithms Auditing scope: integrity of object; integrity of collection; integrity of network; policy compliance; public/transparent auditing Trust model Threat model Auditing Distributed Digital Preservation Networks 17
  • 18. Repair Auditing mitigates risk only if used for repair.Key Design Elements Repair granularity Repair trust model Repair latency: Detection to start of repair Repair duration Repair algorithm Auditing Distributed Digital Preservation Networks 18
  • 19. Summary of Current Automated Preservation Auditing StrategiesLOCKSS Automated; decentralized (peer-2-peer); tamper-resistant auditing & repair; for collection integrity.iRODS Automated centralized/federated auditing for collection integrity; micro-policies.DuraCloud Automated; centralized auditing; for file integrity. (Manual repair by DuraSpace staff available as commercial service if using multiple cloud providers.)Digital Preservation In developmentMechanism Automated; independent; multi-centered; auditing, repair and provisioning; of existing LOCKSS storage networks; for collection integrity, for high-level policy (e.g. TRAC) compliance. Auditing Distributed Digital Preservation Networks 19
  • 20. LOCKSS Auditing & Repair Decentralized, peer-2-peer, tamper-resistant replication & repairRegularity ScheduledAlgorithms Bespoke, peer-reviewed, tamper resistantScope - Collection integrity - Collection repairTrust model - Publisher is canonical source of content - Changed contented treated as new - Replication peers are untrustedMain threat models - Media failure - Physical Failure - Curatorial Error - External Attack - Insider threats - Organizational failureKey auditing limitations - Correlated Software Failure - Lack of Policy Auditing, public/transparent auditing Auditing Distributed Digital Preservation Networks 20
  • 21. Auditing & RepairTRAC-Aligned policy auditing as a overlay networkRegularity Scheduled; ManualFixity algorithms Relies on underlying replication systemScope - Collection integrity - Network integrity - Network repair - High-level (e.g. trac) policy auditingTrust model - External auditor, with permissions to collect meta- data/log information from replication network - Replication network is untrustedMain threat models - Software failure - Policy implementation failure (curatorial error; insider threat) - Organizational failure - Media/physical failure through underlying replication systemKey auditing limitations Relies on underlying replication system, (now) LOCKSS, for fixity check and repair Auditing Distributed Digital Preservation Networks 21
  • 22. SafeArchive: TRAC-Based Auditing & Management of Distributed Digital PreservationFacilitating collaborative replication and preservation with technology Collaborators declare explicit non-uniform resource commitments Policy records commitments, storage network properties Storage layer provides replication, integrity, freshness, versioning SafeArchive software provides monitoring, auditing, and provisioning Content is harvested through HTTP (LOCKSS) or OAI-PMH Integration of LOCKSS, The 22 Dataverse Network,Auditing Distributed Digital Preservation Networks TRAC
  • 23. SafeArchive: Schematizing Policy and Behavior The repository system must be able to identify thePolicy number of copies of all stored digital objects, and the location of each object and their copies.SchematizationBehavior(Operationalization) Auditing Distributed Digital Preservation Networks 23
  • 24. Adding High-Level Policy to LOCKSS LOCKSS Lots of Copies Keep Stuff Safe Widely used in library community Self-contained OSS replication system, low maintenance, inexpensive Harvests resources via web-crawling, OAI-PMH, database queries, Maintains copies through secure p2p protocol Zero trust & self repairing What does SafeArchive Add? Auditing easily monitor number of copies of content in network Provisioning ensure sufficient copies and distribution Collaboration coordinate across partners, monitor resource commitments Provide restoration guarantees Integrate with Dataverse Network digital repository Auditing Distributed Digital Preservation Networks 24
  • 25. Design RequirementsSafeArchive is a targeted vertical slice of functionality through the policy stack Policy Driven status of participating systems Institutional policy creates formal At least one system to initiate new replication commitments harvesting on participating system Documents and supports TRAC No deletion/modification of /ISO policies objects stored on another system Allows Asymmetric Schema based auditing used to Commitments verify collection replication record storage commitments storage commitments document all TRAC criteria size of holdings being replicated demonstrate policy compliance distribution of holdings over time Provide restoration guarantees to owning archive to replication hosts Limited trust No superuser Partners trusted to hold the unencrypted content of other (reinforced with legal agreements) At least one system trusted to read Auditing Distributed Digital Preservation Networks 25
  • 26. SafeArchive Components Auditing Distributed Digital Preservation Networks 26
  • 27. SafeArchive in Action safearchive.org Auditing Distributed Digital Preservation Networks 27
  • 28. Theory vs. PracticeRound 0: Setting up the Data-PASS PLN Looks ok to me - PHB Motto Auditing Distributed Digital Preservation Networks 28
  • 29. THEORY StartExpose Content ( Through Install LOCKSS OAI+DDI+HTTP ) (On 7 servers) Harvest Content (through OAI plugin) Setup PLN configurations (through OAI plugin) LOCKSS Magic Done 29 Auditing Distributed Digital Preservation Networks
  • 30. Application: Data-PASS Partnership Data-PASS partners collaborate to identify and promote good archival practices, seek out at-risk research data, build preservation infrastructure, and mutually safeguard collections. Data-PASS collections 5 Collections Updated ~daily Research data as content 25000+ Studies 600000+ Files =3 verified replicas per collection, >= 2 regions Auditing Distributed Digital Preservation Networks 30
  • 31. Practice (Round 0) OAI Plugin extensions required for: Non-DC metadata Large metadata Expose Content ( Install LOCKSS Through Alternate authentication method OAI+DDI+HTTP ) (On 7 servers) Support for OAI-SETS Non-fatal error handling Harvest Content OAI Provider (Dataverse) tuning: (through OAI plugin) Performance handling for delivery Performance handling for errors Setup PLN configurations PLN Configuration required: (through OAI plugin) Stabilization around LOCKSS versions LOCKSS Coordination around plugin repository Magic Coordination around collection definition Dataverse Network Extensions Generate LOCKSS manifest pages (Theory) License harmonization LOCKSS export control by archive curator Auditing Distributed Digital Preservation Networks 31
  • 32. Results (Round 0) Remaining issues None known Outcomes LOCKSS OAI plugin extensions (later integrated into LOCKSS core) Dataverse Network performance tuning Dataverse Network extensions Auditing Distributed Digital Preservation Networks 32
  • 33. Lesson 0 When innovating plan for substantial gap between prototype and production multiple iterations Auditing Distributed Digital Preservation Networks 33
  • 34. Theory vs. Practice Round 1: Self-AuditA mere matter of implementation - PHB Motto Auditing Distributed Digital Preservation Networks 34
  • 35. THEORY Log Error for Later Investigation (Round 1) LOCKSS Cache Manager Gather InformationStart from Add Replica Each Replica Integrate Information -> Map Network State State NO Compare Current == Network to Policy Policy ? YES Success Auditing Distributed Digital Preservation Networks 35
  • 36. Implementation www.safearchive.org
  • 37. Practice (Round 1) Gathering information required Replacing the LOCKSS cache manager Permissions Reverse-engineering UIs (with help) Gather Information Network magic from Add Replica Each Replica Integrating information required Heuristics for lagged information Integrate Information -> Heuristics for incomplete Map Network State information State Compare == N Heuristics for aggregated Current State Polic O Map to Policy y information ? Comparing map to policy required YES Mere matter of implementation Success (Theory) Auditing Distributed Digital Preservation Networks 37
  • 38. Results (Round 1) Outcomes Implementation of SafeArchive reporting engine Stand alone OSS replacement for LOCKSS cache manager Initial audit of Data-PASS replicated collections Problems Collections achieving policy compliance were actually incomplete Dude, wheres our metadata? Uh-oh, most collections failed policy compliance Adding replicas didnt solve it Auditing Distributed Digital Preservation Networks 38
  • 39. Lesson 1:Replication agreement does not prove collection integrityWhat you see Replicas X,Y,Z agree on collection AWhat you are tempted to conclude: Replicas X,Y,Z agree Collection on collection A A is good Auditing Distributed Digital Preservation Networks 39
  • 40. What can you infer from replication agreement? Replicas X,Y,Z agree Collection on collection A Assumptions: A is good Harvesting did not report errors AND Harvesting system is error free OR Errors are independent per object AND Large number of objects in collection Supporting External Evidence Multiple Systematic Collection Independent Automated Comparison Automated Restore & Harvester Systematic with External Harvester Log Comparison Implementations Harvester Testing Collection Monitoring Testing per Collection Statistics Auditing Distributed Digital Preservation Networks 40
  • 41. Lesson 2: Replication disagreement does not prove corruption What you see Replicas X,Y disagree with Z on collection A What you are tempted to conclude: Repair/Repl CollectionReplicas X,Y disagree ace A on hostwith Z on collection A Collection A Z is bad on host Z Auditing Distributed Digital Preservation Networks 41
  • 42. What can you infer from replication failure?Replicas X,Y disagree Collectionwith Z on collection Assumptions: A on host A Z is bad Disagreement implies that content of collection A is different on all hosts Contents of collection A should be identical on all hosts If some content of collection A is bad, entire collection is bad Possible alternate scenarios Audit Objects in information Collections grow collections are cannot be ??? ??? rapidly frequently collected from updated some host Auditing Distributed Digital Preservation Networks 42
  • 43. Theory vs. PracticeRound 2: Compliance (almost) How do you spell backup? RE-COVER-Y - Auditing Distributed Digital Preservation Networks 43
  • 44. Lesson 3: Distributed digital preservation works with evidence-based tuning and adjustment Diagnostics When network is out of adjustment additional information is needed to inform adjustment Worked with LOCKSS team to gather information Adjustments Timings (e.g. crawls, polls) Understand Tune Parameterize heuristics, reporting Track trends over time Collections Change partitioning to AUs at source Extend mapping to AUs in plugin Extend reporting/policy framework to group AUs Outcomes At time: Verified replications of all collections Currently: Minor policy violations in one collection Worked with LOCKSS team to design further instrumentation of LOCKSS Auditing Distributed Digital Preservation Networks 44
  • 45. Theory vs. PracticeRound 3: Auditing Other PLNs In theory, theory and practice are the same in practice, they differ. - Auditing Distributed Digital Preservation Networks 45
  • 46. Application: Coppul Council of Prairie and Pacific University Libraries Collections 9 Institutions Dozens of collections Journal runs Digitized member content: text, photos, images, ETDS Goal Multiple verified replicas Auditing Distributed Digital Preservation Networks 46
  • 47. Application: Digital Federal Depository Library Program The Digital Federal Depository Library Program, or the USDocs private LOCKSS network replicates key aspects of the United States Federal Depository System. Collections Dozens of institutions (24 replicating) Electronic publications 580+ collections 10TB, including audio and video content Testing only, full auditing not yet performed Auditing Distributed Digital Preservation Networks 48
  • 49. THEORY (Round 3) Gather Information AddStart from Replica Each Replica NO YES Collection Integrate Adjust Sizes, Information -> Polling Map Network State Intervals adjusted? State NO Compare Current == Network to Policy Policy ? YES Success Auditing Distributed Digital Preservation Networks 49
  • 50. Heres where things get even more complicated Auditing Distributed Digital Preservation Networks 50
  • 51. Practice (Year 3)Lesson 6: Trust, but continuously verify 20-80 % initial failure to confirm policy compliance Gather Tuning infeasible, or yielded only Information from Add Replica moderate improvement Each Replica NO YES Integrate Adjust AU Sizes, Information -> PollingOutcomes Map Network Intervals In-depth diagnostic and analysis with State adjusted? State LOCKSS team Compare Current == NO Adjustment of auditing algorithms: Network to Policy Policy ? YES detect islands of agreement Adjust expectations Focus on inferences rather than replication Success agreement Focus on 100% policy compliance per collection rather than 100% error-free Design file-level diagnostic instrumentation in LOCKSSRe-analysis in progress Auditing Distributed Digital Preservation Networks 51
  • 52. What can you infer from replication failure?Replicas X,Y disagree Collectionwith Z on collection Assumptions: A on host A Z is bad Disagreement implies that content of collection A is different on all hosts Contents of collection A should be identical on all hosts If some content of collection A is bad, entire collection is bad Possible alternate scenarios Audit ??? ??? Objects in information Collections grow collections are cannot be rapidly frequently collected from updated some host Auditing Distributed Digital Preservation Networks 52
  • 53. What else could be wrong?Round 1 hypothesisDisagreement is real, but doesnt matter in long run1.1 Temporary differences. Collections temporarily out or sync (either missing objects or different object versions) will resolve over time(E.g. if harvest frequency