dspace oai-pmh
TRANSCRIPT
Harvesting Statstical Metadata from an Online Repository for Data Analysis and Visualization
Outline Goal and Motivation Theseus.fi Dspace Getting Data out from Dspace Dspace OAI-PMH as a Data provider for Theseus Request Types(Verbs) Flow Control Harvesting Data from Theseus’s Data provider Project Result Final thoughts
Goal
Harvest metadata of thesis documents from Theseus
author name, title, keywords, submission year....
Store the harvested data into a separate MYSQL database.
Build a Web portal out of this stored data
Goal and Motivation
Why conduct this project?
Thesis data analysis and visualization of overall statistical facts.
Compare thesis documents
Compare universities and departments
Analyse trending keywords used by students every year
Theseus.fi
Digital libraries are now commonly used by academic institutions worldwide.
Theseus provides online access to theses and publications from Finnish universities of applied sciences.
End users can search, browse and upload thesis documents to Theseus.
...
Theseus also has an API that can be used by third party organizations to utilize theses data.
Theseus is powered by a pioneer open source digital asset management system called Dspace.
Functionalities and features of Theseus are inherited from Dspace.
Dspace
Dspace is an open source software platform that provides stable, long-term storages commonly for digital intellectual materials.
Many academic institutions worldwide use Dspace to offer their users an easy access to their digital resources.
Dspace can be freely downloaded and used or even modified to store digital materials.
AbbreviationsOAI: Open Archives Initiative
PMH: Protocol for Metadata Harvesting
Getting Data out from Dspace
OAI-PMH is HTTP based protocol that defines methods and protocols for sharing, publishing and archiving metadata from Dspace repositories
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is used to programatically access data from Dspace.
Dspace OAI-PMH as a Data provider for Theseus
Dspace repositories have an 'OAI Base URL' in addition the URL for human users.
OAI Base URL : http://publications.theseus.fi/oai/request?
URL for human users : https://www.theseus.fi/
This URL is used in machine to machine communications between data consumers and data harvesters.
When harvesting request is made using the OAI Base URL , Theseus’s data provider returns XML formatted metadata of thesis documents.
…
Theseus OAI-PMH exposes thesis documents in twelve unique metadata formats.
KansalliKirjasto format:
<kk:field schema="dc" element="contributor" qualifier="author" language="none" value=" Denut, Nicolae "/>
OAI Dublin Core format : <dc:creator> Denut, Nicolae </dc:creator>
Each metadata format can be queried to get any data from Theseus’s data provider.
Request Types (Verbs)
There are six methods in OAI-PMH that can be appended to OAI based URLs to access different repository contents.
Theseus implements all six request types to provide thesis metadata to harvesters.
1. Identify: fetches information about Theseus data-provider itself
2. ListMetadataFormats: returns a list of available metadata formats supported by a Theseus data provider
3. ListIdentifiers: lists thesis record identifiers
…
4. ListSets: retrieves the set structure (list of universities and departments) .
5. ListRecords: gets list of complete metadata of thesis documents from a Theseus and
6. GetRecord: retrieves individual metadata of a thesis document
By attaching any one of these request types to Theseus’s OAI base URL,a request URL can be formed.
+AOI Base URL
Request type => Reque
st URL
http://publications.theseus.fi/oai/request?verb=ListSets
Flow control
The three request types ListIdentifiers, ListSets and ListRecords return large lists from Theseus.
In such cases, it is practical to partition them among a series of requests and responses.
Resumption tokens are options from OAI protocol that allow data providers to chunk long list responses in parts.
Resumption token work flow
Harvesting Data from Theseus’s Data provider
Simple HTML DOM parser, is an open source parser library written in PHP to read, modify, and return structured content from external data sources.
This parser library can create a Document Object Model by loading structured data from a URL.
To get nodes of the DOM object , this library provides a method called “find ()”.
Universities Departments Thesis documentsIdentifier (setSpec) identifier (setSpec) Thesis IdentifierUniversity name Department Name Author namesListSets Request URLs ListSets Request URLs TitlesTotal number of papers Total number of papers GetRecord request URLs
University identifiers Department identifiers University identifiers KeywordsSubjects (official keywords)Number of pagesyearLanguage
Summary of gathered theses metadata
84,391 Whoa! That’s a big number, aren’t you proud?
Project Result
• How many Thesis documents are in Theseus?
• Which school has what amount of papers in Theseus?
• How many papers is each school publishing every year?
• What departments are there in each school?
• How many papers belong to which department?
• How many pages does each paper have?
• In what language is the paper written?
• How many times has each paper been downloaded by Theseus visitors?
• What are the keywords of each thesis document?
The built Web portal aims to give better insights on the contribution of each school to Theseus on its front page.
Web portal showing
Departments versus number of Thesis documents in Metropolia UAS
Analysing Keywords is also easy
I want to analyse
keywords
Fill out a form
See results
Keyword fetching form