repositories comp3016 public, managed, web collections of knowledge

42
Repositori es COMP3016 Public, managed, web collections of knowledge

Upload: georgina-holland

Post on 13-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Repositories COMP3016 Public, managed, web collections of knowledge

RepositoriesCOMP3016

Public, managed, web collections of

knowledge

Page 2: Repositories COMP3016 Public, managed, web collections of knowledge

Repositories & Green OA• Open Archiving Initiative - October 1999

– Agreed OAI-PMH for metadata sharing– (2008 OAI-ORE for data exchange)

• Among the Participants– Paul Ginsparg (arXiv)– Carl Lagoze (NCSTRL)– Stevan Harnad (Cogprints)

• EPrints– proposed as a ‘build your own repository’ solution – enable institutions and groups to participate in OAI

metadata sharing initiative

Page 3: Repositories COMP3016 Public, managed, web collections of knowledge

Example Repositoryhttp://eprints.ecs.soton.ac.uk/

A repository for a school of Electronics and Computer Science.

It achieves 80-100% full text self-deposit

Page 4: Repositories COMP3016 Public, managed, web collections of knowledge

Looking at the Differences between a Repository and a Website through a

Whistlestop Tour of the ECS Repository

• Repository provides:– Different views– Different ways of exporting data– Metadata capture

Page 5: Repositories COMP3016 Public, managed, web collections of knowledge

EPrints Walkthrough: Browse

• Browse Views aka “Collections”– Subdivisions– Ordering

Page 6: Repositories COMP3016 Public, managed, web collections of knowledge

EPrints Walkthrough: Views• View content lists as “tag clouds”

or “communities of practice”

Page 7: Repositories COMP3016 Public, managed, web collections of knowledge

EPrints Walkthrough: Searches

Advanced search allows useful reports to be generated:• journal articles funded by NIH published in 2007• conference posters with a PowerPoint file in the Maths department• refereed conference papers or journal articles with full text• old journal articles that haven’t been cited

Page 8: Repositories COMP3016 Public, managed, web collections of knowledge

EPrints Walkthrough: Search Results

Page 9: Repositories COMP3016 Public, managed, web collections of knowledge

EPrints Walkthrough: Exporting Search results

• The output from any search can be exported…

– as RSS feeds

– as METS, Dublin Core or other DL interoperability formats

– as BibTeX, refer, EndNote & other bibliography formats

– to Google Earth, Similie TimeLine or other web services and mashups

Page 10: Repositories COMP3016 Public, managed, web collections of knowledge

EPrints Walkthrough: Infrastructure Exports

Page 11: Repositories COMP3016 Public, managed, web collections of knowledge

EPrints Walkthrough: Infrastructure Exports

Publication lists and data imported by and branded by other research group portals.

Page 12: Repositories COMP3016 Public, managed, web collections of knowledge

EPrints Walkthrough: Depositing a New Item

Page 13: Repositories COMP3016 Public, managed, web collections of knowledge
Page 14: Repositories COMP3016 Public, managed, web collections of knowledge
Page 15: Repositories COMP3016 Public, managed, web collections of knowledge
Page 16: Repositories COMP3016 Public, managed, web collections of knowledge
Page 17: Repositories COMP3016 Public, managed, web collections of knowledge
Page 18: Repositories COMP3016 Public, managed, web collections of knowledge
Page 19: Repositories COMP3016 Public, managed, web collections of knowledge

EPrints Walkthrough: Import Items from Various Sources

Page 20: Repositories COMP3016 Public, managed, web collections of knowledge

Reference Model for a Web Site

• A web site is very simple in its functionality; a repository (as we have seen) is more complex

UPLOAD

REQUEST

DOWNLOAD

Page 21: Repositories COMP3016 Public, managed, web collections of knowledge

Reference Model for an Open Archival Information System (OAIS)

• SIP/DIP/AIP = Submission/Dissemination/Archival Information Package

Page 22: Repositories COMP3016 Public, managed, web collections of knowledge

What is a Repository?• A repository is a platform that allows you to capture items in

any format – – text, – video, – audio, – data.

• It distributes it over the web, mainly via Google• It indexes your work, so users can search and retrieve your

items.• It preserves your digital work over the long term.

Page 23: Repositories COMP3016 Public, managed, web collections of knowledge

What are the benefits of using a repository?

• Some example benefits:– Getting your research results out quickly, to a worldwide audience – Reaching a worldwide audience through exposure to search engines such as

Google – Storing reusable teaching materials that you can use with course management

systems – Archiving and distributing material you would currently put on your personal

website – Storing examples of students’ projects (with the students’ permission) – Showcasing students’ theses (again with permission) – Keeping track of your own publications/bibliography – Having a persistent network identifier for your work, that never changes or

breaks – No more page charges for images. You can point to your images’ persistent

identifiers in your published articles.

Page 24: Repositories COMP3016 Public, managed, web collections of knowledge

What does a Repository look like?

http://www.dspace.org/images/stories/dspace-diagram.pdf

Page 25: Repositories COMP3016 Public, managed, web collections of knowledge

Application Architecture• Repository systems are organised into three tiers which

consist of a number of components

• Each layer only invokes the layer below it i.e. the application layer may not used the storage layer directly

Page 26: Repositories COMP3016 Public, managed, web collections of knowledge

The Storage Layer

• The storage layer is responsible for physical storage of metadata and content

• Repositories use a relational databases to store all information about the organization of content, metadata about the content, information about e-people and authorization, and the state of currently-running workflows.

Page 27: Repositories COMP3016 Public, managed, web collections of knowledge

The Business Logic Layer

• The business logic layer deals with managing the content of the archive, users of the archive (e-people), authorization, and workflow

Page 28: Repositories COMP3016 Public, managed, web collections of knowledge

The Application Layer

• The application layer contains components that communicate with the world outside of the individual repository, for example the Web user interface and the Open Archives Initiative protocol for metadata harvesting service

Page 29: Repositories COMP3016 Public, managed, web collections of knowledge

The Problem of LongTerm Data• Researchers have have hard disks which are just

organised enough to support daily activity but researchers’ careers last for forty years– Disk crashes– Stolen laptops– Software upgrades that go wrong– Backups that never quite get restored– Draws and folders full of old stuff that eventually fall off the

radar• “Lost in some research assistant’s computer, the data

are often irretrievable or an undecipherable string of digits”

Lost in a Sea of Science Data. S.Carlson,The Chronicle of Higher Education (23/06/2006)

Page 30: Repositories COMP3016 Public, managed, web collections of knowledge

Where Are My Files Now?

Page 31: Repositories COMP3016 Public, managed, web collections of knowledge

Preservation, Persistence and Sustainability

• Persistent URLs needed to last across many generations of organisation (e.g. CS Group, CSDept, Dept of ECS, School of ECS)– PURLs, DOIs or Handles– Or just persistent policies for URL naming!

• Persistent storage / across many generations of hardware (e.g. desktop vs cloud)

• Persistent readability / across many generations of software– Format migration– WordPerfect – Word 5.1 – Office 2007

Page 32: Repositories COMP3016 Public, managed, web collections of knowledge

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

• A way of asking an archive about the stuff it’s got in it.

• allows services to harvest metadata from many archives– Google harvests data, OAI-PMH harvests

metadata

• allows services to provide search and other functionality

Page 33: Repositories COMP3016 Public, managed, web collections of knowledge

CogPrints

(GNU EPrints)

1600 Records

www.orgprints.org

(GNU EPrints)

264 Records

arXiv(custom software)230,000 Records

D-Space @ MIT(D-Space Software)

769 Records

Harvester #1(Psychology Service)

500 Cogprints169 D-Space

Harvester #2(Physics Aggregator)

150,000 arXiv162 D-Space

Harvester #3(General Service)

230,000 arXiv769 D-Space

264 OrgPrints1600 CogPrints

150,162 “Improved” recordsfrom physics aggregator

Page 34: Repositories COMP3016 Public, managed, web collections of knowledge

Archive Service A

1403 records

HarvesterGive me everything!

1403 recordsOK!(1403 records)

Day 1

Page 35: Repositories COMP3016 Public, managed, web collections of knowledge

Archive Service A

1501 records

Harvester

Archive Service B

123 records

Give me everything in set “physics”

1403 records

OK!(15 records)

Day 2

15 records

Give me all records which were added or changed since yesterday

OK!(102 new records, 4 deleted records,23 changed records)

1501 records

Page 36: Repositories COMP3016 Public, managed, web collections of knowledge

Archive Service A

1490 records

Harvester

Archive Service B

123 records

Give me everything in set “physics” which were added or changed since yesterday.

1501 records

OK!(0 new records,1 record changed)

Day 3

15 records

Give me all records which were added or changed since yesterday

OK!(25 new records, 36 deleted records,3 changed records)

1490 records

Page 37: Repositories COMP3016 Public, managed, web collections of knowledge

Now, OAI-ORE (Object Exchange and Reuse)

• Repositories are being filled with complex sets of data and metadata.

• ORE is a protocols to allow repositories, agents, and services to use and reuse of compound digital objects beyond the boundaries of the holding repositories.– to facilitate discovery of objects, – to reference (link to) objects (and their parts),– to obtain a variety of disseminations of objects, – to aggregate and disaggregate objects,– to harvest and deposit (register, put) objects– to enable processing by automated agents

Page 38: Repositories COMP3016 Public, managed, web collections of knowledge

ORE: Compound Information Objects

• Identified, bounded aggregations of distinct information units that when combined form a logical whole– Scholarly publication with an article

and supporting information including dataset, video, etc.

– Digitized book with multiple chapters, each chapter containing multiple scanned pages.

– Archaeological assemblies of images, maps, charts, and find lists

– Flickr ‘sets’, comments/annotations etc.

Page 39: Repositories COMP3016 Public, managed, web collections of knowledge

ORE: Publishing compound objects to the Web (1)

• Web graph without any explicit compound objects• each information object identified with a URI• and there are links between them

Page 40: Repositories COMP3016 Public, managed, web collections of knowledge

ORE: Publishing compound objects to the Web (2)

• Compound object and its parts are published to the Web with URIs• Links indicate relationships but cannot show boundaries and true

structure in a machine context

Page 41: Repositories COMP3016 Public, managed, web collections of knowledge

ORE: Publishing compound objects to the Web (3)

• This time … added layer is publishing the compound object and its parts with relationships and boundary as a ‘named graph’

Page 42: Repositories COMP3016 Public, managed, web collections of knowledge

Summary

• Repository adds management services to basic architectural model– ingest, dissemination– management– preservation