building new knowledge from distributed scientific corpus: herbadrop & europeana, two concrete...

22
www.eudat.eu EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 Building new knowledge from distributed scientific corpus HERBADROP & EUROPEANA: two concrete case studies for exploring big archival data 2nd Computational Archival Science (CAS) workshop Boston, USA, December 2017 Pascal Dugénie, Daan Broeder, Nuno Freire

Upload: nuno-freire

Post on 21-Jan-2018

78 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065

Building new knowledge from distributed scientific corpus

HERBADROP & EUROPEANA: two concrete case studies for exploring big archival data

2nd Computational Archival Science (CAS) workshopBoston, USA, December 2017

Pascal Dugénie, Daan Broeder, Nuno Freire

Page 2: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

Massivelydistributedcollections

Digital Infrastructures for Research

Opportunities for preserving valuable scientific heritage

Collaborative Data Infrastructure (CDI)

Trusted Digital Repositories (TDR)ISO 16363, ISO 14721 (OAIS)

High-speednetwork

infrastructures

LONG-TERM PRESERVATION

Monitoring

Data StoragePersistent ID

Metadata

Data curationand policies

Natural heritage Cultural heritage

HPCinfrastructures

BIG DATAanalysis tools

sharing

distributed

corpora

extraction

of text in

images

knowledge

building

visibility

of data

Page 3: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data
Page 4: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

EUDAT: A truly pan-European Infrastructure

EUDAT offers common data

services to both research

communities and individuals

through a large network of

European organisations.

EUDAT wants to enable

European researchers from

any discipline to preserve,

find, access, and process data

in a trusted environment, as part

of a Collaborative Data

Infrastructure.

European infrastructures

Technology Providers

Research Communities

Page 5: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

B2 Service Suite

https://www.eudat.eu/services

Covering both access and

deposit, from informal

data sharing to long-term

archiving, and addressing

identification,

discoverability and

computability of both

long-tail and big data,

EUDAT services seek to

address the full lifecycle

of research data

Page 6: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

Common Language Resources and Technology

Infrastructure (CLARIN)

Building solutions with the

communities

European Network for Earth System Modelling (ENES)

Distributed infrastructure for life-science information

(ELIXIR)

European Plate Observing System (EPOS) - Solid Earth

sciences Research Infrastructure

Integrated Carbon Observation System (ICOS) to quantify

& understand greenhouse gas balance

Long-Term Ecosystem Research (LTER) in Europe

EUDAT services are designed, built and implemented together with

user communities.

Page 7: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data
Page 8: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

Challenges and problem to be solved

Digitalized images

physical copies are fragile

digital copy must be preserved

Exploitation of digital copies

description metadata and classification is complex

images contain a lot of information that should beextracted and made available

Page 9: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

Herbadrop rationale

• Millions of specimens in herbaria all over the world

• Global trend to industrialdigitizing

• Data difficult to handle evenfor medium size institutes

• Same challenges being facedby hundreds of herbaria in Europe

• Makes sense to work togetherto develop a solution

tiff: 180MB zip: 80MB jpg: 1MBTotal: 161MB

Page 10: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

Herbadrop in Europe

MEISE, BE

n

Page 11: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

Herbadrop objectives

PRESERVATION1

INFORMATION

EXTRACTION2

KNOWLEDGE

BUILDING3

deep learning using OCR results with

access with the whole community for

crowdsourcing

long-term preservation of herbarium

specimen images

curent scope

extracting information from images by

using Optical Character Recognition

(OCR) basic image analysis techniques

perspectives

Page 12: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

HERBADROP/EUDAT Workflows

STORAGE

TRANSFER

Transferring

images using

B2SAFE

service

OCR

ACCES MONITORING

images

Performing

OCR

analysis

using HPC

Ingesting OCR

results in a

full text

indexing engine

Controling

data quality

(file format

and integrity)

OCR

ARCHIVING

Surveying

bit-stream

integrity

and data

quality

Ingesting

images and

metadata for

long-term

archiving

Producing

regular

statistical

reports

Producing

regular

statistical

reports

Monitoring

data and

processes

status

reportsstatistics

Harvesting

and indexing

metadata

Offering open

access to full

text engine,

images and

metadata

CERTIFICATION

Implementing a DSA-based certification including appropriate SLA

Page 13: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data
Page 14: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

Europeana:European Cultural Heritage on the

WebThe main goal of Europeana is to provide

access to cultural heritage and encourage

people to engage with culture.

• And the main access point is the Web!

• Promoting the research use of heritage data

resources is in its early stages of

development

CC BY-SAPerspectives on using Schema.org for publishing and harvesting metadata at

EuropeanaCC BY-SA

Page 15: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

The Challenges (1/2)

The Generic Challenge

How to facilitate the re-use of Cultural Heritage language resources for research purposes

… by exploiting the existing and emerging European research infrastructures

How can the resources be discovered

How can the resources be shared in practical ways for researchers

How can advanced computation be applied to these Cultural Heritage datasets

How can the resources and datasets be cited and referenced in research

How can the Cultural Heritage institutions re-use the outcomes of research

Page 16: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

The Challenges (2/2)

The Specific Challenges of the Pilot

To identify requirements for technical interoperability

between the two infrastructures

Creating best practice guidelines for the publication

and citation of cultural heritage data

Facilitate the collaborative work between researchers,

with focus on:

Humanities

Social Sciences

Computer science

Page 17: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

Europeana Newspapers Corpus

The pilot aims to expose the full text aggregated in the

Europeana Newspapers project.

This corpus contains over 11 million pages of full text of

historic newspapers

Mainly from the 19th century

Aggregated from national and research libraries

across Europe.

The pilot aims to expose and improve the text for more

data driven usage

…based on EUDAT Data services…

Page 18: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

EUDAT service uptake

Europeana Newspaper Pilot relies on the following EUDAT services:

Research data storage and sharing (B2SHARE): as to undertake the enrichment of the datasets as well as, more generally, expose them for re-use by other academics, particularly those outside the digital humanities

Persistent Identification Service (B2HANDLE): Persistent identification of the main objects of the full-text corpus: the newspapers titles and individual issues

Multi-disciplinary joint metadata catalogue (B2FIND): so that scientists will be able to

obtain the full corpus for machine processing

select just a portion of the corpus benefitting from the enrichment of article-level annotations with named entities and topics

Page 19: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065

Conclusions &

Perspectives

Page 20: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

Conclusions

• General conclusions:

• A successful application of the EUDAT services was achieved

• Heritage research data brought new requirements to EUDAT

• HERBADROP:

• Application of EUDAT’s computational capabilities are identifying new challenges:

• How to address poor quality OCR

• Amount of data is large and may become a limitation for accurate and exhaustive analysis

• EUROPEANA:

• Learned about the requirements of research usage

• Some may have impact on its data providers

Page 21: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

HERBADROP and EUROPEANA: Some perspectives for data services

Improving discoverability of heritage research data resources

Full-text based

Metadata based

Additional heritage specific metadata support in EUDAT

Dat formats support, and semantics

Semantic annotations

Computational processing for heritage use cases:

OCR

Image analysis tools

Page 22: Building new knowledge from distributed scientific corpus: HERBADROP & EUROPEANA, two concrete case studies for exploring big archival data

For additional information

http://www.eudat.eu/

Nuno Freire,

Europeana DSI/INESC-ID

[email protected]

http://www.europeana.eu/