harvard’s digital repository service (drs) architecture

Post on 06-Feb-2016

47 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Harvard’s Digital Repository Service (DRS) Architecture. Harvard University Library (HUL) Andrea Goethals, Randy Stern December 10, 2009. Today’s Agenda. What is the DRS? DRS 1 Architecture DRS 2 Highlights Questions. 1. What is the DRS?. DRS Context. - PowerPoint PPT Presentation

TRANSCRIPT

Harvard’s Digital Repository Service (DRS) Architecture

Harvard University Library (HUL)Andrea Goethals, Randy Stern

December 10, 2009

Today’s Agenda

1. What is the DRS?2. DRS 1 Architecture3. DRS 2 Highlights4. Questions

1. What is the DRS?

DRS Context A core portion of HUL’s mission is to

provide current and future access to research materials and resources, with recognition that preserving access to digital content requires different strategies, tools and skills

Digital Preservation projects and activities (2000-)

Digital Preservation Program (June 2008-) Centerpiece: the Digital Repository Service

(DRS)

What is the DRS? Set of professionally managed services for

preservation and access

metadata and content storage &

monitoringservice

creation & format

guidelines, training, ingest

service

delivery services,access restrictions, persistent names

preservationplanning

& activities,administration,

management tools

usecreation/acquisition

What’s in the DRS?

What’s in the DRS?

What’s in the DRS?

What’s in the DRS?

What’s in the DRS?

What’s in the DRS?

What’s in the DRS?

What’s in the DRS?

DRS by the numbers 103 TB of content

335 TB total (counting all copies) 13 M files

10 M image files 21,000 audio files 2.8 M text files 851,000 compressed Google books

containing 672 M files 6,300 compressed web harvests

containing 14 M web files

DRS growth

•Fueled by large projects•Recent explosion – mass digitization (Google book project)

0

20

40

60

80

100

120

Oct-00 Oct-01 Oct-02 Oct-03 Oct-04 Oct-05 Oct-06 Oct-07 Oct-08 Oct-09

TB

Broadening content and metadata requirements

New formats and genres, born-digital content Email archiving, more audio, drawing, video

Descriptive metadata, linkages to catalogs Rights management, more access

restrictions Auxiliary content

Contextual material, licenses, donor agreements, collection objects, documentation, repository agents

2. DRS 1 Architecture

DRS System Architecture

TCP/IP

NFS

Metadata Storage

Database

DRS Web Admin Tools

Delivery ServicesIngest Services

Consistency Validation Service Content Storage Service

Metadata Storage Database

DRS-1 Objects are modeled as related files

File Metadata: Administrative (owners, projects, deposit dates, owner IDs,

etc.) Technical (format mime-type & format specific data) Role, purpose, quality No descriptive metadata Access restrictions (public, Harvard-only, dark) MD5 file digest and byte count

Relationship triples “is_part_of”, “is_preservation_replacement_for”, etc. 21 relationship types ~13M files, 12.3M relationships

Content Storage ServiceBit preservation

Redundancy, heterogeneity, extensibility, scalability, simple file access protocol

Access demands high availability and high performance delivery

Functional requirements: At least three copies in three physical locations Two media types Two on-line copies for high availability One near-line copy, one off-line copy

Content Storage ServiceStorage provider

SUN SAM/QFS Storage Archive Manager 2 file classes: highuse and lowuse Archiving rules

High use files Copy 1 on disk at local server center Copy 2 on disk at remote server center Copy 3 on tape in library Copy 4 on tape off line at Harvard Depository

Low use files Copy 1 on disk at remote server center Copy 2 on tape in library Copy 3 on tape off line at Harvard Depository

High speed cache for access

Consistency Validation Service

Continuous monitoring for file system and database consistency Crawls the file system and confirms that every

disk file has a DRS metadata record Crawls the DRS metadata records table and

confirms that every file referenced exists in the file system

Confirms that the MD5 checksum for each file is the same as recorded in the database

Reports errors to administrators

Delivery and Access Services

Real time web delivery Image delivery service

JPEG, JPEG 2000, TIF, GIF Page turned object delivery service

METS + page images + page text Streaming delivery service

Real Audio File delivery service

PDFs Web Archiving Service Asynchronous delivery service

Archival masters

Administrative Services DRS Web Administrator

Searching, reporting, file operations, archival master download

Page Turned Object Maintenance METS structure editor

Name Resolution Service Maintenance URN create/update/report

DRS System Architecture

TCP/IP

NFS

Metadata Storage

Database

DRS Web Admin Tools

Delivery ServicesIngest Services

Consistency Validation Service Content Storage Service

DRS System ArchitectureIngest Services

TCP/IP

NFS

Metadata Storage

Database

DRS Web Admin Tools

Delivery Services

Consistency Validation Service Content Storage Service

DRS Loader

SFTP Drop

Boxes

BatchBuilder

DepositorsWeb Archiving

Service

DRS System ArchitectureDelivery Services

TCP/IP

NFS

Load BalancedDelivery Services

Metadata Storage

Database

DRS Web Admin Tools

Load BalancedDelivery Services

Catalogs – Web Sites - Google

Consistency Validation Service Content Storage Service

DRS Loader

SFTP Drop

Boxes

BatchBuilder

DepositorsWeb Archiving

Service

DRS System ArchitecturePersistent Naming and Access Services

TCP/IP

NFS

Load BalancedDelivery Services

Metadata Storage

Database

DRS Web Admin Tools

Load BalancedDelivery Services

Catalogs – Web Sites - Google

Access Management

Service

Name Resolution Service

Consistency Validation Service Content Storage Service

DRS Loader

SFTP Drop

Boxes

BatchBuilder

DepositorsWeb Archiving

Service

DRS System ArchitectureStorage Services

Disk archive (High use, copy 1)

Site 2 Boston

Site 1 Cambridge

Disk archive (High use, copy 2)

Disk archive (Low use. copy 1)

Tape archive (High use, copy 3)Tape archive (Low use, copy 2)

Media only

Tape archive (High use, copy 4)Tape archive (Low use, copy 3)

Site 3 Westborough

TCP/IP

NFS

Load BalancedDelivery Services

Metadata Storage

Database

DRS Web Admin Tools

Load BalancedDelivery Services

DRS Loader

Catalogs – Web Sites - Google

Access Management

Service

Name Resolution Service

SFTP Drop

Boxes

Consistency Validation Service

BatchBuilder

SAM/QFS

DepositorsWeb Archiving

Service

Storage ServicesImplementation

Sun SAM-QFS 4.6 Rule-based automatic archiving – no “backups” Unified file name space

Dual Sun T2000 Solaris SAM servers Redundant servers at site 1, DR failover at site 2 Nightly samfsdump from site 1 - samfsrestore at site 2

EMC CLARiiON disk storage arrays RAID 1+0 FC cache/ RAID 5 SATA Disk Archives 35TB CX3-40 at site 1, 109 TB CX3-80 at site 2

StorageTek SL500 tape library LTO-4

In production since Feb 2008

Storage ServicesRedundancy

Private TCP/IP

Sun T2000Solaris 10SAM-QFS

Sun T2000Solaris 10SAM-QFS

FC switch FC switch

4 GB cacheSP 4 GB cacheSP

EMC CX3-40FC / SATA, RAID 1+0 / RAID 5

Staging cacheDisk archive (High use, copy 1)

Off-site, HBSPOn-site, UIS

Sun T2000Solaris 10SAM-QFS

8 GB cacheSP 8 GB cacheSP

EMC CX3-80FC / SATA, RAID 1+0 / RAID 5

Disk archive (High use, copy 2)Disk archive (Low use. copy 1)

StorageTek SL 500LTO-4

Tape archive (High use, copy 3)Tape archive (Low use, copy 2)

Robot Drive Drive Drive Drive

Media onlyLTO-4

Tape archive (High use, copy 4)Tape archive (Low use, copy 3)

Off-site, HD

Public TCP/IP

SAMSAMSAMNFS NFS NFS

App serverWeb server

NFS HTTP

Metadata Storage ServiceImplementation

DRS metadata storage Oracle 10G Live production server – copy 1 Dataguard failover copy – copy 2 Legato Tape backups – copy 3

Ingest ServicesImplementation

Batch deposit of SIPs to SFTP drop boxes DRS Batch Loader operates 8AM-8PM 51 object owners – libraries, museums ~12 depositors 234 project codes Daily weekday deposits average ~60

GB/day

Delivery ServicesImplementation

High availability design Redundant public access servers

Delivery, access management, name resolution Cisco Content Switch Load balancing, sticky sessions MRTG monitoring

Change control – no downtime on updates RHE linux, java 1.5, tomcat Tomcat and log4j logging and statistics

3. DRS 2 Highlights

Scope of work Builds on the early 2008 storage upgrade 2008-~2013 Effects every part of the DRS!

Expanded data model New and different metadata Object descriptors Content models Preservation plans Enhanced deposit tools New management applications New backend services

First major release: Summer 2011

Object descriptors A METS metadata file per object on the file

system alongside content files Descriptive, administrative, preservation,

technical and structural metadata Describes the object, all its files and bitstreams

and related significant events Gives the metadata the same secure storage

as the content files Self-contained, portable objects

Some technical challenges Amount of metadata to store

Bitstream description Many elements (esp. MODS, MIX)

Efficient, scalable search implementation Database, index, combination?

Keeping metadata in sync Database, object descriptors on file system

Effect on system of continued growth Consistency checks, migrations, format analysis, etc.

HRCI requirements Email archiving

4. Questions?

andrea_goethals@harvard.edu

randy_stern@harvard.edu

top related