how can repositories support the text-mining of their content and why?
TRANSCRIPT
![Page 1: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/1.jpg)
How can repositories support the text-mining of their
content and why?
@openminted_eu
Dr. Petr Knoth and Dr. Nancy PontikaKnowledge Media institute, The Open UniversityUnited KingdomTwitter: @oacore
![Page 2: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/2.jpg)
Why should repositories support TDM?
@openminted_eu
![Page 3: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/3.jpg)
@openminted_eu
In the UK
![Page 4: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/4.jpg)
Repositories and TDM
@openminted_eu
Institutional Repositories
Subject Repositories
Publishers/ OA journals
Other sources:
Research Networking
ServicesPrimary Research
Data...
Text Mining Services
![Page 5: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/5.jpg)
![Page 6: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/6.jpg)
TDM & Repositories Managers
@openminted_eu
• Established and maintain a close collaboration with researchers
• Extensive experience in advocacy, i.e. open ac-cess
• Knowledgeable about the repository’s collection• Participate in the Academic Institution’s Research
Committees • Knowledgeable of your repository’s collection• Familiarity with Copyright issues and Creative
Commons Licenses
![Page 7: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/7.jpg)
How can repositories support TDM?TDM is all about processing text and data
at scale. The role of repositories is to facilitate the aggregation of research
papers at a full-text level (and beyond) effectively enabling TDM services to operate seamlessly on all available
research content.
7
![Page 8: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/8.jpg)
What is the problem?
@openminted_eu
• A small study (Knoth, 2013)• 83 repositories - mainly Eprints with PDF
research outputs• 1,461,016 metadata records
metadata linked to content
content downloadabl
e
content machine readable
Mean 54.1% 34.4% 27.6%Median 39.5% 16.7% 13.0%Standard deviation
39.2% 34.2% 31.0%
![Page 9: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/9.jpg)
How is content aggregated today?
@openminted_eu
• DC over OAI-PMH: vast majority of repositories, never intended to support content harvesting. The main problem: linking metadata with content.
“The nature of a resource identifier is outside the scope of the OAI-PMH. To facilitate access to the resource associated with harvested metadata, repositories should use an element in metadata records to establish a linkage between the record (and the identifier of its item) and the identifier (URL, URN, DOI, etc.) of the associated resource. The mandatory Dublin Core format provides the identifier element that should be used for this purpose.”
![Page 10: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/10.jpg)
How is content aggregated today?
@openminted_eu
• RIOXX: Just one identifier, recommends the identifier points to the actual resource being described.
• OpenAIRE Guidelines: identifier links to either the resource or a jump-off page. Does allow multiple identifiers.
• ResourceSync
• CrossRef: comercial publishers/journals
![Page 11: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/11.jpg)
The content referencing problem
@openminted_eu
![Page 12: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/12.jpg)
Principle 1: content referencingRepositories should always establish a link from the metadata record to the item the metadata record describes using a dereferencable identifier pointing to the version held locally in the repository. The dereferencable identifier should be provided in the appropriate metadata element in the used metadata format (i.e. dc:identifier in the case of Dublin Core). If multiple identifiers are used, it is recommended listing the local dereferencable identifier first.
12
![Page 13: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/13.jpg)
The accessibility of repositories to harvesting systems
@openminted_eu
![Page 14: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/14.jpg)
Principle 2: Content accessibility to machinesRepositories must provide universal access to machines with the same level of access as humans have. It is the role of repositories to allow aggregators to harvest the entire content of the repository in a reasonable time to enable acquiring and maintain up-to-date information about the repository content.
14
![Page 15: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/15.jpg)
What can repositories do?
@openminted_eu
• Ensure correct referencing of content from metadata:• Dereferencable link which resolves to content• Locally held (content under its control)• Using a standard repository platform can help
• Check robots.txt• Register your repository• Advocate for good pdf (media) quality of deposited
content• Use monitoring tools
• CORE Repository Dashboard• OpenAIRE Repository Manager Dashboard
• Machine readable licensing
![Page 16: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/16.jpg)
beyond Open Access
MAKING SENSE OF LARGE VOLUMES OF SCIENTIFIC CONTENT
16
![Page 17: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/17.jpg)
Interested in how to TDM research papers?
@openminted_eu
We have 3 moretalks tomorrow!
Developer track 1, 11:00Mining Open Access publications with CORE
![Page 18: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/18.jpg)
Interested in how to TDM research papers?
@openminted_eu
We have 3 moretalks tomorrow!
Developer track 1, 11:20Oxford vs Cambridge Contest: Collecting Open Research Evaluation Metrics for University Ranking
![Page 19: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/19.jpg)
Interested in how to TDM research papers?
@openminted_eu
We have 3 moretalks tomorrow!Papers 4, 4:00Exploring Semantometrics: full text-based research evaluation for open repositories
![Page 20: How can repositories support the text-mining of their content and why?](https://reader036.vdocuments.site/reader036/viewer/2022081520/5882c81c1a28abb2478b73a3/html5/thumbnails/20.jpg)
Thank youDr. Pert Knoth,, Research [email protected]. Nancy Pontika, Open Access Aggregation [email protected]
.
20