setting up a clarin centre · 22-01-2020  · persistent identifiers (pids) 11. persistent...

36
Setting up a CLARIN centre Dieter Van Uytvanck CLARIN workshop for newcomers 22 January 2020 Utrecht

Upload: others

Post on 22-Jul-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Setting up a CLARIN centre

Dieter Van Uytvanck

CLARIN workshop for newcomers

22 January 2020

Utrecht

Page 2: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Overview

• The CLARIN technical infrastructure from a bird’s eye• Infrastructure pillars:

- Repositories- Metadata & VLO- Tools & LR Switchboard- Federated Login- Federated Content Search

• CLARIN centres:- Types of centres- B-centre assessment

• Requirements• Core Trust Seal• Procedure

2

Page 3: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

The CLARIN technical infrastructure from a bird’s

eye view

3

Page 4: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Architecture: Repositories

4

Repository at a CLARIN centre

Language Data Metadata Language

Tools

describes

single text or recording

!corpus

!lexicon

!wordnet

!grammar

!…

web application !

web service !

web service pipeline

!stand-alone application

!…

Page 5: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Architecture: Harvesting

5

Language Data Metadata Language

ToolsLanguage

Data Metadata Language Tools

Harvested Metadata

Language Data Metadata Language

ToolsLanguage

Data Metadata Language Tools

copy

Page 6: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Architecture: Processing

6

Page 7: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Architecture: Federated Content Search

7

Language Data Metadata Language

ToolsLanguage

Data Metadata Language Tools

(Federated) Content Search

(1) enter query

(4) show aggregated results

Language Data Metadata Language

ToolsLanguage

Data Metadata Language Tools

(2) perform local search

(3) retrieve results

Page 8: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Summing up

• Basic principles: - compatibility on protocol and format level- acknowledging the unique situation of each centre (history,

organization, technology, etc.)- focus on the strengths that a centre can contribute

• No obligation to use specific software stacks- Of course it can save resources to reuse existing solutions

• Striking the right balance between do-it-yourself and reuse is one of the most important steps in the process of becoming a CLARIN centre

8

Page 9: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Infrastructure pillars

9

Page 10: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Persistent Identifiers (PIDs)

10

Page 11: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Persistent Identifiers (PIDs)

11

Page 12: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Persistent Identifiers (PIDs)

12

Page 13: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Component Metadata (CMDI)

13

Page 14: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

14

Page 15: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

15

Interview Profile

ActorComponent

• First name: text • Last name: text • Birth Date: date • Role: interviewer |

interviewee

Sound recording Component

• Format: wave | mp3 • Length: number

General information Component

• Title: text • Creation Date: date

Concept Registry

definition of: • Title • Creation Date • Format • Length • First name • Last name • Birth date • Role • wave • mp3 • interviewer • interviewee • …

Page 16: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

CMDI: Basic ideas behind it

• Allowing for the flexibility needed: many different subcommunities have their own wishes to provide detailed metadata descriptions

• Stimulating re-use: most providers should be able to re-use an existing profile

• Provide a standard way to - Refer to

• digital objects (in the fixed header)• landing pages• search pages• search services

- Express hierarchies in metadata files

16

Page 17: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

CMDI: What we currently have in place

• 179 profiles, 1252 components at https://clarin.eu/componentregistry

• Over 1900 concepts registered at https://clarin.eu/ccr• Over 22 CLARIN centres providing native CMDI metadata• Important conversion workflows in place:

- Europeana Data Model (121,000 records)- OLAC & Dublin Core- MODS- TEI headers- under investigation & in preparation:

• DDI (social sciences)• DataCite metadata

• Catalogue of harvested metadata: https://vlo.clarin.eu

17

Page 18: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

CMDI: Must reads & myths

• CMDI best practice guide: https://www.clarin.eu/content/cmdi-best-practices-guide

• Metadata in CLARIN – the FAQ: https://www.clarin.eu/faq-page/267

• Myth: CMDI must be used as backend format in my repository.- Incorrect! The only requirement is to deliver it via OAI-PMH.

• Myth: CMDI records must be created manually with an editor- Incorrect!

• A simple and automatic conversion from your existing metadata format or database can be sufficient.

• For large amounts of metadata, manual editors are inconvenient.

18

Page 19: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Repositories

• https://www.clarin.eu/content/repositories• Off-the-shelf:

- LINDAT-Dspace- Dataverse (for a C-centre, not fully B-centre ready)

• But it is open source…

• Half-products- META-share repository + CLARIN-specific adjustments- Islandora (Fedora Commons + Drupal)

• As used at HZSK and FLAT (TLA, Meertens-HuC)

• From scratch- Based on Fedora Commons- Based on your existing home-built repository

19

Page 20: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Repositories: learn more

• https://www.clarin.eu/event/2017/clarin-plus-workshop-facilitating-creation-national-consortia-repositories

• https://www.clarin.eu/event/2016/clarin-workshop-dspace-digital-repository

• Lindat-DSpace tutorial: - https://www.youtube.com/playlist?list=PLlKmS5dTMgw3lJ4Tff

nBJOhprLd_-20jd

20

Page 21: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Tools & LR Switchboard

21

Language Data

Language Tool

Language Data

Language Tool

Language Data

Language Tool

Language Resource Switchboard

Language Data

Language Tool

Page 22: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Tools & LR Switchboard

22

Page 23: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Tools & LR Switchboard

23

Language Tool

Language Resource Switchboard

Language Data

CLARIN

gateway applications:

Virtual Language Observatory

(Virtual Collection Registry, …)

EUDAT

B2DROP

[cloud storage]

Parthenos

D4Science

[Virtual Research Environment]

Page 24: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Tools & LR Switchboard: learn more

• Connecting a web application:- https://switchboard.clarin.eu/help

• Use cases- https://office.clarin.eu/v/CE-2018-1196-

language_resource_switchboard_use_cases.pdf

• Demonstration case:- https://www.clarin.eu/showcase/eosc-portal-demonstration

24

Page 25: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Federated Login

25

Page 26: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Federated Login

• Reading material:- Background (good starting point)

• History and functioning of the Service Provider Federation: https://office.clarin.eu/v/CE-2017-1014-CLARINPLUS-D2_7.pdf

- Overview (e.g. to get the paperwork running):• https://www.clarin.eu/spf

- Technical instructions on creating and integration of a Service Provider:• https://www.clarin.eu/content/creating-and-testing-shibboleth-

sp

26

Page 27: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Federated Content Search

27

Language Data Metadata Language

ToolsLanguage

Data Metadata Language Tools

(Federated) Content Search

(1) enter query

(4) show aggregated results

Language Data Metadata Language

ToolsLanguage

Data Metadata Language Tools

(2) perform local search

(3) retrieve results

Page 28: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Federated Content Search

28

Page 29: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Federated Content Search

• Specifications and background information:- https://www.clarin.eu/content/federated-content-search-

clarin-fcs- https://trac.clarin.eu/wiki/FCS

• Endpoint libraries on the way:- Korp- Kontext- NoSketchEngine

• Possible extensions:- Some first notes on FCS for treebank searches:

• https://www.clarin.eu/blog/blog-post-jan-niestadt-mini-workshop-korp-strix-and-blacklab-gothenburg

- Some first ideas about extending FCS to lexical searches

29

Page 30: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

CLARIN centres

30

Page 31: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Centres: an introduction

• B-centres (Service Providing Centres aka Certified Centres)

• C-centres (Metadata Providing Centres, their metadata are integrated with CLARIN but they need not to offer any further services)

• K-centres (Knowledge Centres, part of the CLARIN Knowledge Sharing Infrastructure)

• E-centres (External Centres offering central services without being part of any national consortium)

31

Page 32: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

B-centres

• Assessment procedure description:- https://www.clarin.eu/node/3767

32

Page 33: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Centre registry

• Accessing information- https://centres.clarin.eu/

• Adding information:- https://www.clarin.eu/content/clarin-centres > Register a new

centre- Adding an entry requires a CLARIN account

• https://user.clarin.eu

33

Page 34: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Checklist

34

Page 35: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Misc links, developer resources

• clarin.eu/dev- Trac, especially Infrastructure Overview

• Slack: request access via [email protected]• Mailing lists: all-centers• Newsflash• Matomo• https://www.clarin.eu/applications

35

Page 36: Setting up a CLARIN centre · 22-01-2020  · Persistent Identifiers (PIDs) 11. Persistent Identifiers (PIDs) 12. Component Metadata (CMDI) 13. 14. 15 Interview Profile Actor Component

Thank you for your attention!

More information:• www.clarin.eu

Feel free to contact • our support addresses at

https://www.clarin.eu/content/support• me via [email protected]

36