mellon e-journal archiving project january20, 2002
TRANSCRIPT
MELLONE-JOURNAL ARCHIVING
PROJECT
January20, 2002
DIGITAL PRESERVATION
THE BIG ISSUE IN DIGITAL LIBRARIES
• Digital is inherently fragile– constant technological change yields short life
for all digital materials
• Nothing will be saved passively– requires constant and conscious action to
preserve
• A core role for research libraries in the digital era????
JOURNAL ARCHIVING IN THE PAPER ERA
• Large-scale redundancy
• Access copy and archival copy usually the same
• Not just storage, but preservation– includes environmental control, library binding,
repair, reformatting. . .
• Deliberate, long-term archiving largely the role of national and research libraries
E-JOURNAL MODEL IS DIFFERENT
• “Copies” are remote, held in publisher systems– not replicated across different institutions
• Perpetual license provides limited comfort in the absence of independent copies
• Long-term preservation involves very different issues than day-to-day access
LACK OF ARCHIVING A GROWING PROBLEM
• Libraries bearing double costs– the e-journals users prefer– the paper for preservation
• Publishers cannot convert totally to digital– authors and editors distrust e-only journals because of
concerns about persistence– libraries demand paper for preservation
• Libraries preserving paper version, but electronic more complete, increasingly the copy of record
MELLON E-JOURNAL ARCHIVING PROGRAM
• 13 institutions invited to submit proposals for a one-year planning project
• Six planning proposals were selected and funded in December 2000– additional project focused on technology
(LOCKSS) also funded
• Second round of Mellon grants to be announced in June will fund actual implementation
SIX PLANNING PROJECTS
• Publisher-based – Harvard (Wiley, Blackwell, University of Chicago
Press)– Penn (Oxford and Cambridge University Presses) – Yale (Elsevier)
• Discipline-based – Cornell (agriculture), – NYPL (performing arts)
• Dynamic e-journals – MIT
SOME BASIC ASSUMPTIONS
• Archive should be independent of publishers– responsibility of institutions for whom archiving is
a core mission
• Archiving requires active publisher partnership• Address long timeframes (100 years?)• Archive design based on Open Archival
Information System (OAIS) model
OBJECTIVES FOR PLANNING PROJECTS
• Develop draft archiving agreements with publisher partners
• Design technical architecture for an archive• Formulate an acquisitions and growth plan• Articulate access policies• Address validation/certification• Design an organizational model, staffing,
long-term funding model
Key planning issues/decisions…
BASE ON DL INFRASTRUCTURE
• Use existing infrastructure for storage, management, preservation, access
• Enhanced to comply with OAIS model
• New ingest and rendering functions
ARCHIVING AGREEMENT
• Explicit archiving license with publisher
• License addresses what content is archived, responsibilities of parties, conditions of use, economics
• Not always an easy negotiation– archiving involves handing publisher’s
intellectual property to independent party
PUSH MODEL
• Publishers will “push” content to be archived to Harvard– on-going regular deposit following on-line
publication of issue• (what happens when issues disappear?)
WHAT CONTENT IS DEPOSITED?
• “Journal issues” are complex– publishers do not treat all journal content the
same (e. g. “front matter” treated as web pages, not objects in content management systems)
– “associated materials” (datasets, images, tables, etc.) not in the print versions
– advertising usually dynamic, and can involve country-specific complexities
SOME COMMON STUFF
• Journal description• Editorial board• Instructions to authors• Rights and usage terms• Copyright statement• Ordering information• Reprint information• Indexes
• Career information• News• Events lists• Discussion fora• Editorials• Errata• Reviewers• Conference
announcements
ARCHIVE MOST CONTENT
• Exclude little except advertisements – different from most “local loading”
• Articles include supplementary materials• Include an “issue object” in addition to the
article components– masthead, news, jobs, meetings, etc
• Reference links problematic– dynamic, frequently separate from article
STANDARD ARCHIVAL ARTICLE DTD
• Publisher’s SGML formats vary widely• Consultant report on practicality of common
archival XML DTD• Dramatically reduces archive complexity• Issues include
– how low a common denominator– extended character sets, formulae, etc.– sacrifice functionality and original appearance– transformations involve risks
DEPOSIT MORE THAN ONE FORMAT?
• Archive must accept PDF in any case– so include both SGML and PDF when
available?• belt and suspenders
– inclined to do this
• Accept publisher’s original SGML also?– conversion to archival DTD will result in loss– inclined to not do this
“DARK-TO-LIGHT”
• Archived material not accessible at deposit– do not compete with publishers
• Content becomes accessible after “trigger event”– default then is universal access
• But how do you know “dark” archival content is still good? – it would be better if there was some on-going
access…..
ACCESS MODEL
• Archived content always accessible to anyone with appropriate license from publisher – might be satisfied by batch export
• After trigger, simple on-line functionality – assume same functionality for auditors
TRIGGER EVENTS
• “N” years after deposit– “N” set by publisher title-by-title
• When title/year no longer commercially accessible on the Internet – still problematic with some publishers
• When content enters public domain
PRESERVATION
• Format-by-format issue
• Archive specifies preferred formats, which will be kept renderable
• Just maintain bits for others– e. g., “associated materials” (datasets, models, etc.)
generally accepted in ANY format• maintaining the viability of such wildly heterogeneous
materials unrealistic
– keep unaltered for future “digital archeology”
ECONOMIC MODEL
• First question is not who pays, but what will it cost…– reducing costs to the minimum is critical
• In general publishers expected to bear preparation costs for archived objects
• Process automation critical to keeping costs low– ingest process
– auditing
PAYMENT WITH DEPOSIT
• Two part fee– ingest fee to cover up-front costs
• varies with publisher effort to create easily archived objects???
– “dowry” to create maintenance endowment
• Sources include subscribers, authors, societies
NEXT…..
• Proposal to Mellon by April 1 for funding to implement an archive– particular parameters of the call-for-proposals still
uncertain
• Original plan suggested 3 or 4 year projects• Intent is to implement archive, contract for
deposit, begin operations– learn by getting dirty hands– help understand issues, costs