linguistic linked open data, challenges, approaches, future work

54
Linguistic Linked Open Data LLOD Challenges, Approaches, Future Work Sebastian Hellmann TKE 2016 1

Upload: sebastian-hellmann

Post on 16-Apr-2017

2.556 views

Category:

Internet


0 download

TRANSCRIPT

Linguistic Linked Open DataLLOD

Challenges, Approaches, Future Work

Sebastian HellmannTKE 2016

1

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

AKSW / KILT in Leipzig Leipzig has become one of the largest Semantic Web centers

AKSW has 4 subgroups and 45 PhD students http://aksw.org/Team.html

Current position:

- Head of AKSW / KILT research group (8 PhD students)- Knowledge Integration and Language Technology (KILT) http://aksw.org/Groups/KILT.html

- Project manager for 2 H2020 and 1 German research project (BMWi)- http://freme-project.eu/ , http://aligned-project.eu/ , http://smartdataweb.de/

- Executive Director of the DBpedia Association http://dbpedia.org

2

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Outline● The vision behind Linked Data - a technological introduction● Linguistic Linked Open Data● Knowledge Modelling vs. Data Encoding● LIDER● Challenges and Approaches

3

Linked Data

4

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Web of DataWWW vs. GGG - https://en.wikipedia.org/wiki/Giant_Global_Graph

Data on the Web vs. the Web of Data vs. the Semantic Web

RDF - Entity Attribute Value - http://dbpedia.org/resource/Copenhagen

Three ways to publish RDF:

1. Linked Data: resource-level access via HTTP request (next slide)2. SPARQL: query access via triplestore database3. Dump: dataset-level access via bulk download

5

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Linked DataFour rules of https://www.w3.org/DesignIssues/LinkedData

1. Use URIs as names for things2. Use HTTP URIs so that people can look up those names.3. When someone looks up a URI, provide useful information, using the

standards (RDF*, SPARQL)4. Include links to other URIs. so that they can discover more things.

https://en.wikipedia.org/wiki/Copenhagen vs. http://dbpedia.org/resource/Copenhagen

Source: https://www.w3.org/DesignIssues/LinkedData.html 6

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Open Data != Open DataOpen Access vs Open License

Open Access means accessible like a web page (often unclear license)

http://opendefinition.org by OKFN:

“Knowledge is open if anyone is free to access, use, modify, and share it — subject, at most, to measures that preserve provenance and openness.”

7

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016 8

http://lod-cloud.net/

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

How is the Linked Data Cloud built?

9

- Open Access as the basis- 50 links between things required to receive

a dataset link- http://lov.okfn.org- http://datahub.io - Assessing Quantity and Quality of Links Between Linked Data Datasets by Ciro Baron Neto, Dimitris Kontokostas,

Sebastian Hellmann, Kay Müller, and Martin Brümmer in LDOW 2016 http://events.linkeddata.org/ldow2016/papers/LDOW2016_paper_09.pdf

Linguistic Linked Open Data

10

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Linguistic Linked Open Data● Movement originated in the context of the Working Group for Open Data in

Linguistics (OWLG) at Open Knowledge Foundation (OKFN)● Open is supposed to mean Open license● Join community mailing list at http://linguistics.okfn.org/ ● Current information at http://linguistic-lod.org/

maintained by John McCrae -> Instructions on how to join the LLOD cloud

11

January 2011

12

13

February 2012

Linked Data in Linguistics. Representing Language Data and Metadata (http://www.springer.com/computer/ai/book/978-3-642-28248-5 ) Christian Chiarcos, Sebastian Nordhoff, and Sebastian Hellmann (Eds.). Springer, Heidelberg, (2012)

August 2012

14

Sept 2012MLODE

15

Special Issue on Multilingual Linked Open Data (MLOD)Editors: Sebastian Hellmann, Steven Moran, Martin Brümmer, and John McCrae, Semantic Web, vol. 6, no. 4, pp. 315-317, 2015

Jan 2013

16

Sep 2013

17

LIDER FP7 EU Project Start: Nov 2013 Duration: 2 yearshttp://lider-project.eu/

May 2014

18

LIDER FP7 EU Project Start: Nov 2013 Duration: 2 yearshttp://lider-project.eu/

Nov 2014

19

LIDER FP7 EU Project Start: Nov 2013 Duration: 2 yearshttp://lider-project.eu/

May 2015

20

LIDER FP7 EU Project Start: Nov 2013 Duration: 2 yearshttp://lider-project.eu/

May 2016

21

LIDER FP7 EU Project Start: Nov 2013 Duration: 2 yearshttp://lider-project.eu/

22

Should we all use Linked Data?

23

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Should we all use Linked Data?

When should we use linked data?

How should we use linked data?

When should we not use it?

24

Knowledge Modeling vs. Data Encoding

25

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Entity Relationship Diagrams and UML

26

The Metadata Ecosystem of the DataId Ontology, Markus Freudenberg, submitted to MTSR Conf 2016

http://dataid.dbpedia.org

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

XML encoding variants

27

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

XML encoding variants

28

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

XML encoding variants

<same> should be symmetric, reflexive and transitive https://en.wikipedia.org/wiki/Equivalence_relation

Apples and oranges

29

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Who can you ask what XML tags and structure mean and what they are used for?

30

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Who can you ask what XML tags and structure mean and what they are used for?

31

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Internationalization Tag Set (ITS) 2.0http://www.w3.org/TR/its20/

● W3C Recommendation since 29 October 2013● defines how to embed Machine Translation and Localisation

annotations, so called Data Categories, in (X)HTML and XML● In addition to the human-readable document two ontologies are referenced

that capture the semantics of the standard.● ITS Ontology as companion● NLP Interchange Format (NIF) is the recommended format for RDF

conversion of ITS2.0 http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core

32

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Internationalization Tag Set (ITS) 2.0

33

One of the most efficient and robust ways to annotate HTML in a standardized manner

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

NLP Interchange Format 2.0 (old example)

34

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

NLP Interchange Format 2.0 (old example)

35

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

NIF 2.1 release pendingJoin W3C Community Group: https://www.w3.org/community/ld4lt/

NIF useful for:

● Adding semantics to NLP tool output and corpora● Providing and publishing identifiers for text and annotations

NIF is compact and scalable (cf. http://wiki-link.nlp2rdf.org/ ):

● Google Wikilinks Corpus with 10.6 million webpages and 31.5 million Wikipedia links (about 3 per page) with a zipped size of 180 GB.

● 533 million triples (other formats 7-27% more) ● 79 GB (12 GB gzipped dumps) in Turtle format (original size 180 GB containing HTML markup)

36

LIDER Towards a linguistic linked data ecosystem

37

Website: http://lider-project.eu Guidelines: http://lider-project.eu/?q=guidelines

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

NIF

38

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

LIDER - Deliverable 2.1.2

39

http://www.lider-project.eu/sites/default/files/D2.1.2-Phase-II.pdf

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

LIDER Reference Architecture Deliverable 3.1.2.General:

lemon - developed by

40

http://www.lider-project.eu/sites/default/files/D3.1.2-v2.0.pdf

Challenges and Work in Progress

41

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Identifier management- Ideal identifiers are stable, i.e. the meaning behind the URI does not change- Unrealistic for most use cases - Easier for individuals, i.e. persons, organisations- Non-trivial for terminology

Proposals:

1. Apply software development practices, i.e. versioning, update scripts http://vocol.org , http://github.org , http://aligned-project.eu

2. ??42

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Knowledge Fusion- Linking is mostly done manual- Linking 200 datasets pairwise requires maintenance of 40000 mappings- Adding one after the other depends on the merge order- Ideally we would be able to structure all datasets into clusters before linking

Proposals:

1. Under discussion with: Erhard Rahm - The Case for Holistic Data Integration ADBIS 2016 Keynote: http://adbis2016.vsb.cz/keynote/ (to appear)

2. Apply software development processes: https://github.com/dbpedia/links

43

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

The Metadata ChallengeWhere to publish metadata for your data?

- Barrier between data and dataset description- Stale metadata- Single point of truth missing- Metadata too heterogeneous- Download link missing- No (sufficiently) complete view over the web of data possible, discovery failure

Proposals:

1. build an index: http://linghub.lider-project.eu/ (Clarin, LRE Map, Metashare, Datahub)2. create a better schema: http://dataid.dbpedia.org and provide benefits for complying

44

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

MMoOn- LIDER

- Lemon- ODRL- Olia - NIF

- Morphology quite complex- Specific to language and to the

linguist - http://mmoon.org

45

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

The Metadata Challenge 2● RDF structure is too simple to keep additional metadata

○ Scope○ Validity○ Confidence○ Technical metadata, i.e. collection time

Contextualisation is probably already better researched in lexicography than in Semantic Web.

46

Future work and take home messages

47

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

● Data Quality can be defined and measure with the tools.● http://svn.aksw.org/papers/2014/WWW_Databugger/public.pdf Test-driven

Evaluation of Linked Data Quality by Dimitris Kontokostas, Patrick Westphal, Sören Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, and Amrapali J. Zaveri in Proceedings of the 23rd International Conference on World Wide Web

● Current standard:○ https://www.w3.org/TR/shacl/

Data quality and verification

48

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Open licenses in research

49

Are you willing to publish your data under an open

license?

Can you make a product out of your data?

No

Yes

Start

Congratulations, your paper has been accepted

Yes

Good luck, we wish you all the best and a high profit

No

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Entity Linking Verification - new translator job profile

● http://www.freme-project.eu/ ● Business Case: Integrating semantic enrichment into multilingual content in

translation and localisation● In the future, translators and lexicographers

might be asked to judge entity linking andverify data

50

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Should I invest in publishing linked data?Long-term data strategy, if you:

● Have many expected inbound links

● Persistent ids● Long term hosting and curation

Is no problem for you

-> yes (data value increases)

One time thing:

● Interest of externals only in the yellow zone-> Publish under open license (let someone else do it)

51

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

DBpedia AssociationDBpedia+

● Maintain identifier space● Add open and member data to DBpedia+● Add data following the LIDER guidelines● Ability to add your backlinks

DBpedia Community meeting on the 15th of September in Leipzig

52

Sebastian Hellmann - AKSW/KILT Copenhagen TKE 2016

Events in 2016● KEKI 2016 Workshop - Uses of Linguistic Linked Open Data http://keki2016.

linguistic-lod.org/ Deadline is 1st of July, but might be extended● http://2016.semantics.cc

53

Thank [email protected]

54