digital libraries: an aid to education through interoperable open archives of resources u. kentucky...

73
Digital Libraries: An Aid to Education through Interoperable Open Archives of Resources U. Kentucky February 24, 2000 Edward A. Fox [email protected] http://fox.cs.vt.edu CC CS DLRL Internet TIC Virginia Tech,

Upload: roy-may

Post on 27-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Digital Libraries: An Aid to Education through Interoperable

Open Archives of Resources

U. KentuckyFebruary 24, 2000

Edward A. Fox

[email protected] http://fox.cs.vt.edu

CC CS DLRL Internet TIC

Virginia Tech, Blacksburg, VA, USA

Acknowledgements (Selected)

Sponsors: ACM, Adobe, IBM, Microsoft, NSF, OCLC, US Dept. of Education, …

Co-PIs: Marc Abrams, Robert Akscyn, John Eaton, Gail McMillan

Students: Fernando Das Neves, Robert France, Neill Kipp, Paul Mather, Constantinos Phanouriou, James Powell, Ohm Sornil, David Watkins, Chang Zhang, Jianxin Zhao

Remember!

VT (education and technology)PetaPlex, Envision, MARIAN, NRGDL, 5S (to understand and build DLs)CSTC, CRIM (add to, use) -> NSDLOAI (convention, meetings, proposals)

Virginia Tech Background Largest university in Virginia, land-grant, town

population 35K plus 25K students Blacksburg Electronic Village, since 1992, with 80% of

community on Internet Net.Work.Virginia, largest ATM network, with over 750

sites, for education, research, government LMDS, Local Multipoint Distribution Service, gigabit

wireless networking - 1/3 of Virginia Math Emporium, 500 workstations Faculty Development Initiative, round 2

Supporting Authors (Teachers and Learners)

FacultyDevelop.Initiative

ETDSupport

Virginia TechDigital Library

UniversityLibraries

Classifying/Cataloging/Preserving

Collaboration

Visualization

MM

IR

EPub

HCI

Model Classroom of the 21st CenturyTechnology Showcase ATM Video Conf. Develop MM

New MediaCenter

Dig. Library & Archives

McBryde 110

Model Classroom of 21st Century ATM-based VTEL system Apple G3, Media 100, 120G, BetaCam SP,

FireWire, one of almost any device Large Smart Board IBM Multimedia PC, … Supports spring multimedia class (CS4624) Tom Wilkinson’s staff and systems supporting

innovation in learning grants

ACITC Advanced Communications and Information Technology Center,

opening summer 2000 Connects to the library, with a focus on IT 1/3 high-tech (multimedia) classrooms 1/3 digital/electronic library (reading room) 1/3 research labs: 10, including:

– Digital Library Research Laboratory (DLRL)– Center for Applied Technologies in the Humanities– Center for Human-Computer Interaction (HCI) – extending 5 year $2M

NSF Research Infrastructure project that has usability laboratories (individuals, 2-person teams, groups)

– HPC; Multimedia; Visualization (CAVE), ...

End-to-End Innovation

OC3 OC3

OC3

NET.WORK.VIRGINIAWorld’s Most Advanced Public Network

Statewide Access

Regional / National Access

Blacksburg Electronic VillageLMDS Wireless Technology

Multimedia Service Access Point

Local Community Access

Internet 2 / NGIMultimedia Network Access Point

PetaPlex

Digital Library Machine (“super” object store) Parallel computer / storage utility for scale of 1000

to 100,000,000 gigabytes (1 Tbyte - 100 Pbyte) Knowledge Systems Incorporated is supplying VT-

PetaPlex-1 for $250,000 with– high speed backbone connection(s)

– 2.5 terabytes through 100 “nanoservers”:

– Each = Network connection + IBM 25GB disk + 233 MHz Pentium II + Linux

PetaPlex Complex

FRONT END MACHINERS/6000, 1G RAM, 4 Proc.

Nanoserver

Nanoserver

Nanoserver

Nanoserver

Nanoserver

Nanoserver

Nanoserver

Nanoserver

Nanoserver

Nanoserver Nanoserver

Nanoserver

Nanoserver Nanoserver

Service

Machine 1

Service

Machine 2

Service

Machine 3

Service

Machine 4

PetaPlex Service Machine Possibilities

Front-end provides handle/repository abstraction through hashing

Small object server Large object server

– video on demand– streaming audio

Information retrieval server Proxy / cache server (e.g., 1 terabyte server

of 1000 worldwide for Comsat/Intelsat)

PetaPlex Top View

4 ft.

side

PetaPlex Side View

4 ft. wide

8 ft.

high

Roles:* Support* Cooling* Power

15

shelves

Comparison  Network of

Workstations (NOW)

Beowulf PetaPlex

Archi- tecture

Cluster of general purpose workstation class machines using off-the-shelf network interconnect

General purpose PCs, interconnected with a custo- mized network

Special purpose architecture tuned for superstorage. Uses a mix of off-the-shelf PC compo- nents and specialized network interconnects.

Cost per node

Workstation prices. Between $2000-$2500/node

Mid to low-end PC prices. Between $1200-$1800 per node

Mass produced components will reduce price to around $100/node

Target area

Computation Computation Storage; computation is a secondary function

Filesystem support

UNIX flavors UNIX flavors Replaces location dependant files with location independent fine-grained URN named objects

ENVISION

NSF “A User-Centered Database from the Computer Science Literature” (1991-93)

Collected bib/typesetter data, converted to SGML Scanned thousands of page images MARIAN search engine - can be made available (also

applied to the Virginia Tech library catalog) used as part of a prototype object-based DL, with tailored visualization interface (L. Nowell dissertation)

Envision Results Window

MARIAN

Multiple Access Retrieval of Information with ANnotations

(Musical: Marian the Librarian …) Evolved from 1980’s CODER system to a

distributed Online Public Access Catalog (OPAC), then DL backend, now becoming a full DL system

From C/C++ to Java by Jianxin Zhao Future uses: NDLTD, NUDL, PetaPlex

MARIAN Layers

Database Layer

Search Engine Layer

User Information Layer

User Interface Layer

User User User User

MARIAN Parallelism

Java part response time vs. query rate comparation

(type 1 requests)

01000200030004000

0 100 200 300 400 500

query rate (#/min)

resp

onse

tim

e (m

s)

all modules in one machine one "webgate"

two "webgate"s four "webgate"s

MARIAN Response Time

Four "webgate"s, decomposed time delay vs. query

rate

0

1000

2000

3000

4000

0 100 200 300 400 500

query rate (#/min)

time

dela

y (m

s)

system after Java server

France Dissertation

Key developer since CODER Applying computational linguistics efforts

with machine readable dictionaries Applying opportunistic handling of term

lists for ranking, usable displays (“to be or not to be, that is the”)

Developing and evaluating variety of interfaces

Network Research Group NSF 3 year grant on WWW logging,

characterization, and optimization: Abrams, Fox, Pollard (CNS)

Core member of Web Characterization Activity of World-Wide Web Consortium

Providing DL to support WCA (at http://www.w3c.org/WCA):– logs– tools– publications

Example: NRG Tools

WebJamma: Artificial HTTP traffic generator

WebWatcher: HTTP traffic monitoring and logging system

CLFmunge: Anonymizes common log format

HTTPdump: Protocol decode for tcpdump

Caching proxy simulator

Splus programs

Log description and validation interface & routines

How do universities anddigital libraries relate?

Each U. will have its own digital library. Hence there will be large numbers (i.e., critical mass).

All students will learn how to use and how to “feed” digital libraries (and bring those habits to future work as needs and skills).

All digital library problems (esp. federation, flexibility, personalization) appear at U’s (so they are a good type of testbed, with willing collaborators in-place for developing solutions).

Start with NDLTD, extend to NUDL

SPIRE Visualization

Digital Libraries --- Virginia Tech

MARIAN (NLM) CS DL Prototype - ENVISION (NSF, ACM) TULIP (Elsevier, OCLC) BEV History Base (NSF, Blacksburg) DL for CS Education - EI (NSF, ACM) WATERS, NCSTRL (NSF) NDLTD (SURA, US Dept. of Education) CSTC (NSF, ACM), CRIM (NSF, SIGMM) WCA (Log) Repository (W3C) VT-PetaPlex-1 (Knowledge Systems)

Digital Libraries --- Objectives

World Lit.: 24hr / 7day / from desktop Integrated “super” information systems: 5S: streams,

structures, spaces, scenarios, societies Ubiquitous, Higher Quality, Lower Cost Education, Knowledge Sharing, Discovery Disintermediation -> Collaboration Universities Reclaim Property Interactive Courseware, Student Works Scalable, Sustainable, Usable, Useful

DLs: Why of Global Interest? National projects can preserve antiquities and

heritage: cultural, historical, linguistic, scholarly Knowledge and information are essential to economic

and technological growth, education DL - a domain for international collaboration

– wherein all can contribute and benefit– which leverages investment in networking– which provides useful content on Internet & WWW– which will tie nations and peoples together more strongly

and through deeper understanding

DL Challenges

Preservation - so people with trust DLs

Supporting infrastructure - networks, ...

Scalability, sustainability, interoperability

DL industry - critical mass by covering libraries, archives, museums, corporate info, govt info, personal info - “quality WWW” integrating IR, HT, MM, ...

– Need tools & methods to make them easier to build

Computing (flops)Digital content

Com

mun

icat

ions

(ban

dwid

th, c

onne

ctiv

ity)

Locating Digital Libraries in Computing andCommunications Technology Space

Digital Libraries technologytrajectory: intellectualaccess to globally distributed information

less more

D ig ita l L ib ra r y C o n te n t

A rtic le s ,R e p o rts,

B o o ks

T e xtD o cum e n ts

S p ee ch ,M u s ic

V id eoA u d io

(A e ria l)P h o tos

G e og rap h icIn fo rm ation

M o d e lsS im u la tio ns

S o ftw a re ,P ro g ra m s

G e no m eH u m a n,a n im a l,

p la n t

B ioIn fo rm ation

2 D , 3 D ,V R ,C A T

Im ag es a ndG ra p h ics

C o n te n tT yp e s

Definition: Digital Libraries are complex systems that

help satisfy info needs of users (societies)provide info services (scenarios)organize info in usable ways (structures)present info in usable ways (spaces)communicate info with users (streams)

5S Layers

Societies

Scenarios

Spaces

Structures

Streams

Definition: 5S FrameworkSocieties: interacting people (, computers) Scenarios: services, functions, operations, methodsSpaces: domains + constraints (e.g., distance,

adjacency): 2D, vector, probabilityStructures: relations, trees, nodes and arcsStreams: sequences of items (text, audio, video,

network traffic) (5 Element System: Fire, Wood, Earth, Metal, Water)

5S: Components

Societies: roles, rituals, reasons, relationships, artifacts Scenarios: acquire, index, consult, administer, preserve Spaces: physical, temporal, functional, presentational,

conceptual Structures: architectures, taxonomies, schema,

grammars, links, objects Streams: granularities, protocols, paths, flows,

turbulences

5S: Combinations

Societies + Scenarios = user model Societies + Scenarios + Spaces = user

interface Streams + Structures = markup Streams + Structures + Scenarios = object Structures + Scenarios = DBMS

How to Build a Digital Library

Understand the problem (using the 5S

Framework)

Solve the problem (using the Star

Methodology)

– design, develop, evaluate,

– refine, operate

Neill Kipp Dissertation

Training interested groups about 5S and the Star Methodology, refining the Framework to have solid mathematical foundation

Case studies of projects at Virginia Tech or involving VT staff/students: CSTC, NDLTD, NARA (National Archives, with SAIC), Lexis, ...

Open also to study DL projects elsewhere Focusing too on the design artifacts developed and

related issues of efficient description and representation (esp. with markup, hypermedia)

DLs Shorten the Chain from

Editor

Publisher

A&I

Consolidator

Library

Reviewer

DLs Shorten the Chain to

Editor

A&I

Digital

LibraryReviewer

DLs Shorten the Chain to

Author

Reader

Digital

LibraryEditor

Reviewer

Teacher

Learner

Librarian

Enhancing Learning with DLs

DigitalLibraries

In te ra c tiveExperiences

E n hanc ingL earn ing

N D L T DN e tw o rke d D L

o f Th e se s &D isse rta tio ns

S tu de n t P o rfo liosS e lf-A rch iv ing

G ra y L ite ra tu re(D e p t. o f E d u c .)

W 3 C W C AR e p o s ito ry

L o g s, T oo ls,P u b lica tio ns

C S T CC S

T e a ch ingC e n te r

C R IMC u rricu lu mR e so u rcesIn te r. M M

C o m p u te rS c ie n ce

(w ith N S Fa n d A C M )

DigitalLibraries

In te rac tiveExperiences

E n hanc ingL earning

Enhancing Learning with DLs

DigitalLibraries

A u tho ring(te x t, m ark u p ,h yp erm ed ia ,

ca ta lo g in g -D C )

S u b m itt ingW o rk (E T D )(M eta da ta ,P D F , X M L)

P re s erv ing(u s in g s td s,m ig ra tin g ,

ve rs ion in g )

A dd ing toD ig ita lL ib ra ry

(s tu de n t)

D isc o v e rin g ,B ro w s in g ,S e a rch in g ,R e trie v ing

A nn o ta tin g ,D o w n lo ad in g ,

In s ta llin g ,F e e db a ck

5 S F ra m e w o rk:S o c ie tie s ,S c e na rio s,

S tre am s,S p ac e s,S tru ctu res

U s in gD ig ita l L ib ra ry

(d ire c t)(in fo lite ra cy)

In d ire c tly U s ingD ig ita l L ib ra ry(e m b ed de d ,b y ag en t, . . .)

U s in g D LC o n ten ts (to o ls,d a ta se ts , en v 's,co u rse w a re , . . .)

C o lla bo ra tion(in /a ro u nd D L

a n d its a rt ifa c ts -d is tan c e e du c .)

O th erIn te rac tiveL e arn ingA c tiv it ies

In te rac tiveExperiences

E n hanc ingL earning

NSF Education Innovation (EI) NSF “Interactive Learning with a Digital Library in

Computer Science” (1993-98) 45 online courses (esp. Internet, IR, MM,

Professionalism, overall EI project pages): 100+K accesses/wk

Tools: SWAN (visualization), QUIZIT Evaluation

– traditional– network logging and analysis– tools for visualization

Digital Library Courseware

http://ei.cs.vt.edu/~dlib/ WWW pages or large PDF copy files Online quizzes based on book by Michael Lesk

(Morgan Kaufmann Publishers) Contents based on book, with several other popular

topics added (e.g., agents) Separate pages to supplement: Definitions,

Resources (People, Projects), and References

CS -> CSTC -> CRIM NSF and ACM Education Committee are funding a 2

year project “A Computer Science Teaching Center” - CSTC - http://www.cstc.org/

College of NJ, U. Ill. Springfield, Virginia Tech Focus initially on labs, visualization, multimedia Multimedia part is also supported by a 2nd grant to

Virginia Tech and The George Washington University: http://www.cstc.org/~crim/ (with curricular guidelines also under development)

CS Teaching Center (CSTC) Instead of building large, expensive multimedia packages, that become

obsolete and are difficult to re-use, concentrate on small knowledge units.

Learners benefit from having well-crafted modules that have been reviewed and tested.

Use digital libraries to build a powerful base of support for learners, upon which a variety of courses, self-study tutorials & reference resources can be built. [See NSF NSDL - National Science (math, engineering, technology education) Digital Library (formerly SMETE-lib) at http://www.dlib.org/smete/public/smete-public.html]

ACM Education Board and SIG support, new NSF grant with COLLEGIS Research Institute and others …

Browsing (1)

Browsing (2)

CRIM Rationale

MM field needs properly trained personnel Support this with resources + curricula Together these help us move toward a DL

for Interactive MM -> CS -> NSDL Benefits will go to teachers (who have more

to build upon) and students (who will have a richer environment for learning

CRIM Project Activities

Workshops, other ways to involve community WWW site including DL in CSTC re MM

– Devised cataloging schema, designed interface

– Referring to all MM syllabi and curriculum

– Inviting learning resources for the CRIM DL, with reviews, reuse certifications

Publish report on MM curriculum through ACM and IEEE, after careful review

CSTC, CRIM will lead to ACM Journal of Educational Resources in Computing (JERiC)

Virginia Tech CRIM Related Courses

Art: Digital Art and Design course (Photoshop) CS: 1604 Introduction to the Internet (1 cr.) CS: 3604 Professionalism in Computing CS: 4624 Multimedia, Hypertext and

Information Access (3 cr.) CS: 5604 Information Storage & Retrieval (3 cr.) CS: 6604 Digital Libraries (3 cr.)

SMETE Library -> NSDL(from www.dlib.org to NSF DLI-2)

Context: Global movement toward Digital Libraries (see April 1998 CACM)

NSF effort: Science, Mathematics, Engineering, and Technology Education Digital Library (focussed on undergraduates)– 3 workshops, yearly increasing funds / new calls

SMETE Library likely to operate as distributed federation, with separate parts for each key discipline, and to lead to a global effort

Open Archives Initiative History

xxx at LANL = Los Alamos National Laboratory (Ginsparg) for high-energy physics - 1991

CSTR + WATERS = NCSTRL (Lagoze) - 1994 xxx + NCSTRL = CoRR collaboration - 1998 UPS (Universal Preprint Service) – 1999 mtg

– Herbert Van de Sompel (U. Ghent, SFX) …– Dublin Core (DC), XML– Dienst protocol and software (Lagoze)

Renamed late 1999 as OAI

OAI Philosophy

Self-archiving = submission mechanism Long-term storage system = archive Open interface = harvesting mechanism Data provider + service provider Start with e-prints / pre-prints

Open Archives (protoproto)

ArXiv & Los Alamos National LabCogPrints & U. SouthamptonNACA & NASA (reports)NCSTRL & Cornell U.NDLTD & Virginia TechRePEc & U. Surrey(Washington U. & EconWPA)

Open Archives Members Original Participants in the Open Archives Initiative

– Caroline Arms, Library of Congress– Leslie Carr, University of Southampton– Mark Doyle, American Physical Society– Dale Flecker, Harvard University– Edward A. Fox, Virginia Tech– Michael Friedman, HighWire Press, Stanford University– Paul M. Gherman, Vanderbilt University– Paul Ginsparg, Los Alamos National Laboratory & xxx– Stevan Harnad, University of Southampton– Thomas Krichel, University of Surrey & RePEc– Carl Lagoze, Cornell University– Rick Luce, Los Alamos National Laboratory– Clifford Lynch, Coalition for Networked Information– Kurt Maly, Old Dominion University– Michael L. Nelson, NASA Langley Research Center– John Ober, California Digital Library– Bob Parks, Washington University & EconWPA– Herbert Van de Sompel, University of Ghent– Eric F. Van de Velde, California Institute of Technology– Don Waters, The Andrew W. Mellon Foundation– Ken Weiss, California Digital Library

Others Joining (selected)– University of Virginia – Jim French, Worthy Martin, Thornton Staples, – NEC Research Institute - C. Lee Giles and Steve Lawrence– Internet Archive - Kurt Bollacker, Marlita Kahn– India - University of Mysore – Shalini Urs– Mexico – University of Monterrey - David Garza Salazar

VT Open Archives – Initial Set

NDLTD – global (DC – listserv)NDLTD – VT (MARC, DC)CSTC (DC format, ACM format)W3C WCA logs (XML, atomic)

Approaches to Open Archives

Build ByDiscipline

Build By Institution

AuthorCategoryInterdisciplinaryYearLanguageQuery …

Institutions / Disciplines

Universities: part, all, sets of Disciplines: buy in as in Germany

– Physics, Chemistry, Math, Sociology, Educ.

Basis for Federation:– Language – German, Spanish, French, CJK

– Politics – OhioLink, National Library of Portugal, ISTEC for Latin America

– Economics – Developing Countries (UNESCO)

Open Archives Initiative (OAI)www.openarchives.org

Santa Fe meeting, Oct. 21-22, 1999 and protoproto Next mtg June 3, San Antonio, between HT’00 & DL’00 LANL, CNI, DLF, Mellon, … Convention (see Feb. D-Lib Magazine) Archives -> Open Archives

– Support unique archive identifiers

– Implement Open Archives Metadata Set (DC-based, using XML)

– Implement Dienst harvesting interface

– Register the archive

Build tools, layer other services: linking, searching, …

Tiered Model of Interoperability

Mediator services

Metadata harvesting

Document models

Repository of Digital Objects

RepositoryAccessProtocol

handle

Digital object

terms and conditions

Figure 1. Layers Related to Open Archives Initiative

Services

Search/Browse

Authoring Citation Checking Submission

Metadata Creation

Editorial: Reviewing, Certification

Registry

Archives: Name, ID, Description, Terms and Conditions, …

Metadata Formats: Name, XML DTD, …

Archive Formats: Name, Standard, Preservation Process, …

Protocols Tools

Services

Copy-Edit / Add Value Citation DB Updating

Authority Control

Preservation Conversion

Text/MM Editing

Gazetteer Cataloging

Collaboration

Annotation

Summarization

Citation / Linking

SFX

CiteSeer

Repository NCSTRL Repository

EconWPA Repository

RePEc Repository

Repository for NDLTD Open Archives Harvesting Protocol

Metadata Formats: OA Metadata Set, NDLTD Standard (DC-based) Set

Transaction Log

Training Resources

VT Partition

Record (Metadata)

Record (Full Content)

… …

UVA Partition

Metadata Content

Caltech Partition

Metadata Content

Interoperability for NDLTD

Naming Data exchange: share MARC records Performance, reliability:

replication(mirroring) Federated searching

– Query on content, metadata, links/relationships Dynamic linking / extended services Browsing, viz., working in concept space Annotating/reviewing/certifying

Perspective/goals: removing barriers

Mechanisms

Sharing– Join federation, run software– Make metadata and archive available

Aggregating– By discipline– By institution– By genre

Automating– Workflow– Harvesting and providing services– Federated searching– Dynamic linking

OAI-Related Proposals

CNPQ – collaboration with PUC Rio

CONACyT – collaboration with UDLA and Monterrey (Mexico)

FIPSE preproposal – GSDI + OAI – with Caltech, U. Cincinatti (OhioLink), U. Kentucky, U. Iowa, USF (FL center for library automation)

Remember!

VT (education and technology)PetaPlex, Envision, MARIAN, NRGDL, 5S (to understand and build DLs)CSTC, CRIM (add to, use) -> NSDLOAI (convention, meetings, proposals)