harvard’s digital repository service (drs) architecture

39
Harvard’s Digital Repository Service (DRS) Architecture Harvard University Library (HUL) Andrea Goethals, Randy Stern December 10, 2009

Upload: amaris

Post on 06-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Harvard’s Digital Repository Service (DRS) Architecture. Harvard University Library (HUL) Andrea Goethals, Randy Stern December 10, 2009. Today’s Agenda. What is the DRS? DRS 1 Architecture DRS 2 Highlights Questions. 1. What is the DRS?. DRS Context. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Harvard’s Digital Repository Service (DRS) Architecture

Harvard’s Digital Repository Service (DRS) Architecture

Harvard University Library (HUL)Andrea Goethals, Randy Stern

December 10, 2009

Page 2: Harvard’s Digital Repository Service (DRS) Architecture

Today’s Agenda

1. What is the DRS?2. DRS 1 Architecture3. DRS 2 Highlights4. Questions

Page 3: Harvard’s Digital Repository Service (DRS) Architecture

1. What is the DRS?

Page 4: Harvard’s Digital Repository Service (DRS) Architecture

DRS Context A core portion of HUL’s mission is to

provide current and future access to research materials and resources, with recognition that preserving access to digital content requires different strategies, tools and skills

Digital Preservation projects and activities (2000-)

Digital Preservation Program (June 2008-) Centerpiece: the Digital Repository Service

(DRS)

Page 5: Harvard’s Digital Repository Service (DRS) Architecture

What is the DRS? Set of professionally managed services for

preservation and access

metadata and content storage &

monitoringservice

creation & format

guidelines, training, ingest

service

delivery services,access restrictions, persistent names

preservationplanning

& activities,administration,

management tools

usecreation/acquisition

Page 6: Harvard’s Digital Repository Service (DRS) Architecture

What’s in the DRS?

Page 7: Harvard’s Digital Repository Service (DRS) Architecture

What’s in the DRS?

Page 8: Harvard’s Digital Repository Service (DRS) Architecture

What’s in the DRS?

Page 9: Harvard’s Digital Repository Service (DRS) Architecture

What’s in the DRS?

Page 10: Harvard’s Digital Repository Service (DRS) Architecture

What’s in the DRS?

Page 11: Harvard’s Digital Repository Service (DRS) Architecture

What’s in the DRS?

Page 12: Harvard’s Digital Repository Service (DRS) Architecture

What’s in the DRS?

Page 13: Harvard’s Digital Repository Service (DRS) Architecture

What’s in the DRS?

Page 14: Harvard’s Digital Repository Service (DRS) Architecture

DRS by the numbers 103 TB of content

335 TB total (counting all copies) 13 M files

10 M image files 21,000 audio files 2.8 M text files 851,000 compressed Google books

containing 672 M files 6,300 compressed web harvests

containing 14 M web files

Page 15: Harvard’s Digital Repository Service (DRS) Architecture

DRS growth

•Fueled by large projects•Recent explosion – mass digitization (Google book project)

0

20

40

60

80

100

120

Oct-00 Oct-01 Oct-02 Oct-03 Oct-04 Oct-05 Oct-06 Oct-07 Oct-08 Oct-09

TB

Page 16: Harvard’s Digital Repository Service (DRS) Architecture

Broadening content and metadata requirements

New formats and genres, born-digital content Email archiving, more audio, drawing, video

Descriptive metadata, linkages to catalogs Rights management, more access

restrictions Auxiliary content

Contextual material, licenses, donor agreements, collection objects, documentation, repository agents

Page 17: Harvard’s Digital Repository Service (DRS) Architecture

2. DRS 1 Architecture

Page 18: Harvard’s Digital Repository Service (DRS) Architecture

DRS System Architecture

TCP/IP

NFS

Metadata Storage

Database

DRS Web Admin Tools

Delivery ServicesIngest Services

Consistency Validation Service Content Storage Service

Page 19: Harvard’s Digital Repository Service (DRS) Architecture

Metadata Storage Database

DRS-1 Objects are modeled as related files

File Metadata: Administrative (owners, projects, deposit dates, owner IDs,

etc.) Technical (format mime-type & format specific data) Role, purpose, quality No descriptive metadata Access restrictions (public, Harvard-only, dark) MD5 file digest and byte count

Relationship triples “is_part_of”, “is_preservation_replacement_for”, etc. 21 relationship types ~13M files, 12.3M relationships

Page 20: Harvard’s Digital Repository Service (DRS) Architecture

Content Storage ServiceBit preservation

Redundancy, heterogeneity, extensibility, scalability, simple file access protocol

Access demands high availability and high performance delivery

Functional requirements: At least three copies in three physical locations Two media types Two on-line copies for high availability One near-line copy, one off-line copy

Page 21: Harvard’s Digital Repository Service (DRS) Architecture

Content Storage ServiceStorage provider

SUN SAM/QFS Storage Archive Manager 2 file classes: highuse and lowuse Archiving rules

High use files Copy 1 on disk at local server center Copy 2 on disk at remote server center Copy 3 on tape in library Copy 4 on tape off line at Harvard Depository

Low use files Copy 1 on disk at remote server center Copy 2 on tape in library Copy 3 on tape off line at Harvard Depository

High speed cache for access

Page 22: Harvard’s Digital Repository Service (DRS) Architecture

Consistency Validation Service

Continuous monitoring for file system and database consistency Crawls the file system and confirms that every

disk file has a DRS metadata record Crawls the DRS metadata records table and

confirms that every file referenced exists in the file system

Confirms that the MD5 checksum for each file is the same as recorded in the database

Reports errors to administrators

Page 23: Harvard’s Digital Repository Service (DRS) Architecture

Delivery and Access Services

Real time web delivery Image delivery service

JPEG, JPEG 2000, TIF, GIF Page turned object delivery service

METS + page images + page text Streaming delivery service

Real Audio File delivery service

PDFs Web Archiving Service Asynchronous delivery service

Archival masters

Page 24: Harvard’s Digital Repository Service (DRS) Architecture

Administrative Services DRS Web Administrator

Searching, reporting, file operations, archival master download

Page Turned Object Maintenance METS structure editor

Name Resolution Service Maintenance URN create/update/report

Page 25: Harvard’s Digital Repository Service (DRS) Architecture

DRS System Architecture

TCP/IP

NFS

Metadata Storage

Database

DRS Web Admin Tools

Delivery ServicesIngest Services

Consistency Validation Service Content Storage Service

Page 26: Harvard’s Digital Repository Service (DRS) Architecture

DRS System ArchitectureIngest Services

TCP/IP

NFS

Metadata Storage

Database

DRS Web Admin Tools

Delivery Services

Consistency Validation Service Content Storage Service

DRS Loader

SFTP Drop

Boxes

BatchBuilder

DepositorsWeb Archiving

Service

Page 27: Harvard’s Digital Repository Service (DRS) Architecture

DRS System ArchitectureDelivery Services

TCP/IP

NFS

Load BalancedDelivery Services

Metadata Storage

Database

DRS Web Admin Tools

Load BalancedDelivery Services

Catalogs – Web Sites - Google

Consistency Validation Service Content Storage Service

DRS Loader

SFTP Drop

Boxes

BatchBuilder

DepositorsWeb Archiving

Service

Page 28: Harvard’s Digital Repository Service (DRS) Architecture

DRS System ArchitecturePersistent Naming and Access Services

TCP/IP

NFS

Load BalancedDelivery Services

Metadata Storage

Database

DRS Web Admin Tools

Load BalancedDelivery Services

Catalogs – Web Sites - Google

Access Management

Service

Name Resolution Service

Consistency Validation Service Content Storage Service

DRS Loader

SFTP Drop

Boxes

BatchBuilder

DepositorsWeb Archiving

Service

Page 29: Harvard’s Digital Repository Service (DRS) Architecture

DRS System ArchitectureStorage Services

Disk archive (High use, copy 1)

Site 2 Boston

Site 1 Cambridge

Disk archive (High use, copy 2)

Disk archive (Low use. copy 1)

Tape archive (High use, copy 3)Tape archive (Low use, copy 2)

Media only

Tape archive (High use, copy 4)Tape archive (Low use, copy 3)

Site 3 Westborough

TCP/IP

NFS

Load BalancedDelivery Services

Metadata Storage

Database

DRS Web Admin Tools

Load BalancedDelivery Services

DRS Loader

Catalogs – Web Sites - Google

Access Management

Service

Name Resolution Service

SFTP Drop

Boxes

Consistency Validation Service

BatchBuilder

SAM/QFS

DepositorsWeb Archiving

Service

Page 30: Harvard’s Digital Repository Service (DRS) Architecture

Storage ServicesImplementation

Sun SAM-QFS 4.6 Rule-based automatic archiving – no “backups” Unified file name space

Dual Sun T2000 Solaris SAM servers Redundant servers at site 1, DR failover at site 2 Nightly samfsdump from site 1 - samfsrestore at site 2

EMC CLARiiON disk storage arrays RAID 1+0 FC cache/ RAID 5 SATA Disk Archives 35TB CX3-40 at site 1, 109 TB CX3-80 at site 2

StorageTek SL500 tape library LTO-4

In production since Feb 2008

Page 31: Harvard’s Digital Repository Service (DRS) Architecture

Storage ServicesRedundancy

Private TCP/IP

Sun T2000Solaris 10SAM-QFS

Sun T2000Solaris 10SAM-QFS

FC switch FC switch

4 GB cacheSP 4 GB cacheSP

EMC CX3-40FC / SATA, RAID 1+0 / RAID 5

Staging cacheDisk archive (High use, copy 1)

Off-site, HBSPOn-site, UIS

Sun T2000Solaris 10SAM-QFS

8 GB cacheSP 8 GB cacheSP

EMC CX3-80FC / SATA, RAID 1+0 / RAID 5

Disk archive (High use, copy 2)Disk archive (Low use. copy 1)

StorageTek SL 500LTO-4

Tape archive (High use, copy 3)Tape archive (Low use, copy 2)

Robot Drive Drive Drive Drive

Media onlyLTO-4

Tape archive (High use, copy 4)Tape archive (Low use, copy 3)

Off-site, HD

Public TCP/IP

SAMSAMSAMNFS NFS NFS

App serverWeb server

NFS HTTP

Page 32: Harvard’s Digital Repository Service (DRS) Architecture

Metadata Storage ServiceImplementation

DRS metadata storage Oracle 10G Live production server – copy 1 Dataguard failover copy – copy 2 Legato Tape backups – copy 3

Page 33: Harvard’s Digital Repository Service (DRS) Architecture

Ingest ServicesImplementation

Batch deposit of SIPs to SFTP drop boxes DRS Batch Loader operates 8AM-8PM 51 object owners – libraries, museums ~12 depositors 234 project codes Daily weekday deposits average ~60

GB/day

Page 34: Harvard’s Digital Repository Service (DRS) Architecture

Delivery ServicesImplementation

High availability design Redundant public access servers

Delivery, access management, name resolution Cisco Content Switch Load balancing, sticky sessions MRTG monitoring

Change control – no downtime on updates RHE linux, java 1.5, tomcat Tomcat and log4j logging and statistics

Page 35: Harvard’s Digital Repository Service (DRS) Architecture

3. DRS 2 Highlights

Page 36: Harvard’s Digital Repository Service (DRS) Architecture

Scope of work Builds on the early 2008 storage upgrade 2008-~2013 Effects every part of the DRS!

Expanded data model New and different metadata Object descriptors Content models Preservation plans Enhanced deposit tools New management applications New backend services

First major release: Summer 2011

Page 37: Harvard’s Digital Repository Service (DRS) Architecture

Object descriptors A METS metadata file per object on the file

system alongside content files Descriptive, administrative, preservation,

technical and structural metadata Describes the object, all its files and bitstreams

and related significant events Gives the metadata the same secure storage

as the content files Self-contained, portable objects

Page 38: Harvard’s Digital Repository Service (DRS) Architecture

Some technical challenges Amount of metadata to store

Bitstream description Many elements (esp. MODS, MIX)

Efficient, scalable search implementation Database, index, combination?

Keeping metadata in sync Database, object descriptors on file system

Effect on system of continued growth Consistency checks, migrations, format analysis, etc.

HRCI requirements Email archiving