vrc: preservation risk management for web resources nancy y. mcgovern, ecure 2004
Post on 18-Dec-2015
217 views
TRANSCRIPT
VRC: Preservation Risk Management for Web Resources
Nancy Y. McGovern, ECURE 2004
VRC Funding
• Part of a 4(5)-year NSF-funded project – supported by the Digital Libraries Initiative, Phase 2
(Grant No. IIS-9905955, the Prism Project)
• Also partially funded by a grant from The Andrew W. Mellon Foundation– Political Communications Web Archiving
http://www.crl.edu/content/PolitWeb.htm
• For updates:– http://irisresearch.library.cornell.edu/VRC/
Current Team
Anne R. Kenney, Advisor
Nancy Y. McGovern, Project Manager
Richard Entlich, Sr. Researcher
William R. Kehoe, Technology Coordinator
Ellie Buckley, Digital Research Specialist
Erica Olsen (recent)
Carl Lagoze, CIS PI
Research Scope
see, "Preservation Risk Management for Web Resources: Virtual Remote Control in Cornell's Project Prism"
by Anne R. Kenney, Nancy Y. McGovern, Peter Botticelli, Richard Entlich, Carl Lagoze, and Sandra Payette
in DLib Magazine, January 2002
http://www.dlib.org/dlib/january02/kenney/01kenney.html
Virtual…
• because VRC develops models to represent essential features of selected Web sites
• that enable ongoing monitoring over time
• to identify, respond to, and mitigate potential risks to the site integrity and longevity
Remote…
• because VRC is intended for use by cultural heritage institutions
• interested in the longevity of Web resources
• residing on remote servers –not owned or managed by the monitoring
institution
Control…
• because at the most proactive end of the VRC approach
• a monitoring organization may act to protect another organization's resources
• by agreement or implicit consent
• through notification and/or action
Purpose
• Develop a model for research libraries (adaptable to other contexts)
• Support spectrum from passive monitoring to active capture
• Lifecycle support: selection to capture
• Understand nature of Web resources
• Promulgate good practice
Types of Web Resources
Two types of initiatives for monitoring and/or capture of:
• Web-based publications [Web site as a means]
• All of (or a subset of) a Web site consisting of pages within a boundary defined by a URL (or a portion of one) [Web site as an end] (VRC)
Nature of Risks
Two perspectives on Web-based risk:
• potential liability of an institution based upon the content of its Web site, or a Web site for which it is responsible
• potential threats to the integrity and longevity of a Web resource (VRC)
Types of Risks
Include:
• technological obsolescence
• security weaknesses and breaches
• human-error in developing/maintaining sites
• organizational issues; benign neglect
• power and technology failures
• inadequate backup and secondary systems
Risk Factors
• Organizational Context
• Combination of indicators
• Monitoring (change/loss over time)
• Triggers (events, organizational, upgrades)
• Degradation of site management indicators
VRC Stages
1. Identification
2. Analysis
3. Appraisal
4. Strategy
5. Detection
6. Response
Human – Tool Scenario
1. Identification– Human: identify Web resources of interest– Tool: verify list, expand list
2. Analysis– Tool: crawl sites, generate characterizations– Human: accept/revise characterizations
3. Appraisal– Human: define/review attributes of value– Tool: support appraisal, capture results
Human – Tool Scenario4. Strategy
– Human: develop/review strategies– Tool: plot appraisals, compile strategies
5. Detection– Human: define risk parameters– Tool: identify/assess risks; propose responses
6. Response– Tool: propose risk response based on rules;
automatic response for some risk categories– Human: monitor automated responses; select
response based on recommended actions
Contextual Layers
Server-level Monitoring• Potential multi-site impact• Server vulnerabilities put site content at risk
– deletion or modification
• Patches and new versions of Microsoft IIS and Apache server released frequently
• Apache http server 1.3 security updates– to version 1.3.26 on June 18, 2002 – to version 1.3.27 on October 3, 2002
Apache HTTP server upgrades
0
10
20
30
40
50
60
70
6/1
7/2
00
2
6/2
4/2
00
2
7/1
/20
02
7/8
/20
02
7/1
5/2
00
2
7/2
2/2
00
2
7/2
9/2
00
2
8/5
/20
02
8/1
2/2
00
2
8/1
9/2
00
2
8/2
6/2
00
2
9/2
/20
02
9/9
/20
02
9/1
6/2
00
2
9/2
3/2
00
2
9/3
0/2
00
2
10
/7/2
00
2
10
/14
/20
02
10
/21
/20
02
10
/28
/20
02
11
/4/2
00
2
11
/11
/20
02
11
/18
/20
02
11
/25
/20
02
12
/2/2
00
2
12
/9/2
00
2
12
/16
/20
02
12
/23
/20
02
12
/30
/20
02
1/6
/20
03
1/1
3/2
00
3
Week of server check
% S
erv
ers
up
gra
de
d
Upgrades to Apache 1.3.26--Asia sites Upgrades to Apache 1.3.26--ARL sitesUpgrades to Apache 1.3.27--Asia sites Upgrades to Apache 1.3.27--ARL sites
Server-level Monitoring
VRC Toolkit
• Identify tools for each stage (adopt, adapt, define, devise)
• Leverage existing; apply to longevity
• Analyze steps - automated and manual
• Formalize protocol
• Provide a framework to map existing, plug gaps with developments
VRC Toolkit
Development steps:
– extensive literature review– development of tool categories– definition of categories and test protocols– survey existing tools for evaluation – select representative for testing – highlight findings in category summaries
Web Crawling
• traversing Web sites via links
• a capability common to most tools, but with different purposes and results
• the VRC toolkit needs more than just Web crawlers
Tool Categories
Link checkers
Web site monitors
Web crawlers
Site management
Change Management
Site Mapping (includes visualization)
OAIS Issues
• Pre-Ingest: Selection options
• Ingest: Capture– vs. monitoring– Targets, level and frequency
• Archival Storage: Formats
• Access: Site(s) vs. Page(s)
• AIP: Metadata issues
Management Issues
• frequency of capture – determined by– nature of sites/pages– events: technological, organizational– resources
• well-informed crawling
• valuable vs. archival
Mandate
• to fully document the site by capturing all changes to the pages/sites
• to capture significant changes to pages/sites
• to record periodic versions of the site
• to capture one-time copy of pages/sites
Current Activities
• VRC Preservation Risk Management Program:– Map stages to tool requirements– Apply to potential organizational scenarios– Enable risk/response scenario development
• Toolkit:– Revise and populate tool inventory– VRC Control Site
Future Projects
• Develop approach for building human sexuality collection: capturing Web blogs and other Internet communications
• State Government Web site case study
• Demonstrators for toolkit scenarios
For Discussion
What would the VRC approach have to address to be of interest, value, and/or potential impact for archivists and records managers?