hathi trust a shared digital repository digital preservation, hathitrust, and the reimagination of...

Post on 27-Mar-2015

215 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

HATHI TRUST A Shared Digital Repository

Digital Preservation, HathiTrust, and the Reimagination of the

Library Landscape

Jeremy YorkIceland

August 5, 2010

Outline

• Digital Preservation in U.S. • HathiTrust– About HathiTrust– Content– What we do (services)– Governance– Partnership & Resources

• Google Settlement• Publishing• Changing Library Landscape

Books and Journals Archives Data

Portico• Centralized• Journals• Source files, mainly focused on XML, highly controlled transformation

Internet Archive• Centralized • Web files

ICPSR• Centralized• Social science data

LOCKSS• Distributed• Journals• Web files, not source images or XML

MetaArchive (NDIIPP)• Distributed• Private LOCKSS Network• Web files

DATA-PASS (NDIIPP)• Distributed• Social science data

HathiTrust• Centralized• Books and Journals• Master image and OCR files

International Internet Preservation Consortium• Distributed• Harvesting tools, Access, Preservation strategies

GeoMAPP (NDIIPP)• Distributed • Geospatial data• State governments

OCLC – Digital Archive• Centralized• Master files, web archiving• CONTENTdm, custom repository

LOCKSS, DuraCloud, DSpace, Fedora

NDIIPPMission: Develop a national strategy to collect, preserve and make available

significant digital content, especially information that is created in digital form only, for current and future generations.

• Since 2000• Broad collaborations with institutions and organizations (e.g., OCLC, Portico)• Funding (Establishing a network, Preserving Creative America, Preserving State Government Information)• Standards/Best Practices• Tools

o JHOVE2 (validation)o Chronopolis (data grid framework)o Dataverse (management, dissemination, exchange, and citation of virtual collections (dataverses) of quantitative data)o BagIt (transfer utilities - creation, manipulation and validation of bags)o Hub and Spoke (repository interoperability)o FITS (bundle of identification, validation and metadata extraction tools)

About

HathiTrust Digital Library

• Digital Repository– Initial focus on digitized book and journal content– “Light” archive

• Collections and Collaboration– Comprehensive collection– Shared strategies– Local services– Public Good

Current Partners

– Columbia University– New York Public Library– University of California system– CIC (Committee on Institutional Cooperation)

– University of Virginia– Yale University

University of ChicagoUniversity of IllinoisIndiana UniversityUniversity of IowaUniversity of Michigan Michigan State University

University of MinnesotaNorthwestern University Ohio State University Pennsylvania State University Purdue University University of Wisconsin-Madison

Content Distribution

6,383,209 – Total1,234,088 – Public Domain

* As of August 5, 2010

Language Distribution (1)

* As of July 25, 2010

Language Distribution (2)The next 40 languages make up ~13% of total

* As of July 25, 2010

Dates

* As of July 25, 2010

Originating Institution

* As of July 25, 2010

Content over time

* As of July 25, 2010

Content Growth

What we do

Services (1)

• Ingest– Google, Internet Archive– Working toward sustainable model for ingest of

content from diverse sources• Long-term preservation– Bit-level, migration– Standard and open formats (ITU G4 TIFF,

JPEG2000, JPG, Unicode)– OAIS, TRAC– Validation, integrity, redundancy

Services (2)

• Preservation…with Access• Brings concerns of research libraries to bear on the

way the scholarly record is cared for and made available– Scholarly Resource– Bibliographic Search– Full-text search– Collections– Full-PDF download of public domain

Services (4)

• Rights Management– Rights Database– Copyright review• US 1923-1963• 188k candidates, 85k reviewed• 60% in public domain

• Data Distribution– Metadata files, Bib API, Data API

• Print on Demand

Services (5)

• Community Development Environment• Non-Google Ingest• Non-Book/Non-Journal Ingest• Computational Research

Outlook

• Leverage partner resources and input to create and maintain the library of the future

• This is our library• The more we use it, the better it will become

Governance

Governance

HathiTrustHathiTrust

Executive Committee

Strategic Advisory

Board

Strategic Advisory

BoardBudget/FinancesDecision-making

Guidance on Policy, Planning

Partnership &Resources

Funding

• Funded for a initial 5 years with base-funding from partners

• 3-year review of governance and sustainability• Budget – separately held within

UMich budget system• Cost Models – Per GB cost of storage per year with a one-time fee on new

content to build a capital fund– Volume overlap

Cost Model 1

Reasonable costs of sustaining the archive, includes cost of replacement, capital fund

Cost Model 1

• Economies of scale keep costs low– $0.145/volume/year for Google-digitized– about $0.45/volume/year for IA-digitized

• Advantages not fully known until you jump in

Cost Model 2

• Shared space to deal with shared problems– Use HathiTrust as part of broader library strategies

• Beginning to see benefits of aggregating this body of materials together– Overlap, collection development– Coordinated print management– Begin to ask “What is missing”?

For public domain volumes: (PD*X*C)/N

For a given in copyright volume:IC=(C*X)/H

• Share in costs of curation• Share in uses of relevant materials• Voice in future directions • Free riders?

Cost Model 2

Staff

• Staff/Expertise – highly integrated– Project managers, IT and communications

staff, copyright experts, administrators (UM,

Indiana and UC taking the lead)• Working groups• Shared development space

Financial contributions of partners

HathiTrust Functional Framework

Working Groups

Current• Quality• Discovery Interface (with OCLC)• Collections• Communication• UsabilityPast• Storage• Research Center

Google Settlement (1)• 2005, Author’s Guild, AAP sued• Google claimed fair use• Settlement – 2008• Amended – Nov 2009• Works covered– registered with U.S. copyright office, Canada, UK,

Australia• Works not covered– public domain, published after 5 Jan 2009

Google Settlement (2)• Google continues scanning• In copyright, non-commercially available out-of-print work

– Sell individual access, any book retailer - 63% of revenue to rights holders, distributed by BRR

– display up to 20%– Copy & paste and printing– Rights holders can open access, distribute under CC, set printing limits– Institutional subscription (available to libraries, fee based on FTE

users)• Includes unclaimed works

– BRR required to search for rights holders and hold revenue on their behalf

• Public access terminals• Cash payments to Rightsholders whose works were scanned

before May 5, 2009

Book Rights Registry• Book Rights Registry

– Represent the interests of the Rightsholders – equal representation of Author and Publisher sub-classes on board; one author and publisher representative from US, UK, Canada, Australia; court-appointed representative for rights holders of unclaimed works

– Establish and maintain a database of contact information for authors and publishers;

– Use commercially reasonable efforts to locate Rightsholders; – Distribute payments received from Google for the Rightsholders’

share of revenues; and – Assist in the resolution of disputes between Rightsholders. – Funded by Google (initial 34.5 million, ongoing percentage of

revenues)

http://www.googlebooksettlement.com/help/bin/answer.py?hl=en&answer=118704

Settlement for HathiTrust

• Complementary– Settlement provides access to covered works,

HathiTrust is preservation, trust for the future– Research Center (75% of Google Book Search scanned

from HathiTrust partner libraries)• Specifically sanctions– Section 108 uses, access for users with print

disabilities, computational research• Does not allow– Fair use, sale of access, interlibrary loan, e-reserves,

use in course management systems

Publishing

• Libraries would like to buy more eBooks• Cost is high• Not good models for consortia (multiple users)• Move to on-demand purchase, leasing of

volumes• Do we need to own it?

Changing Library Landscape

• Leverage collective resources, expertise– Drive costs down– Increase discoverability, use– Improve strength of archiving– Reduce redundancy of collections (digital and

print), effort– Address collective challenges

• Focus on local resources and services• Redefine who we are, what we provide– Collections, research

Thank you!

top related