or: how to build an open data knowledge graph · 82 data portals 160k datasets unknown format...

Semantic Enrichment of Open DataOr: How to build an Open Data Knowledge Graph

Sebastian [email protected] https://sebneumaier.wordpress.com/

Advisor: Dr. Axel PolleresReviewers: Dr. Christian Bizer, Dr. Elena Simperl

Rigorosum, TU Wien, November 20, 2019 Slides: tiny.cc/sebneu 1

https://sebneumaier.wordpress.com/

http://tiny.cc/sebneu

Open Data comes in various ways

2

CSV (3-star)

Excel (2-star)

PDF (1-start)

82 data portals 160K datasets

Unknown format (1-star)

RDF? Not significant

Tim Berners-Lee’s 5-star deployment scheme for Open Data

Available data is only partially structured and not linked

Umbrich, J., Neumaier, S., Polleres, A.: Quality assessment & evolution of open data portals. International Conference on Open and Big Data (2015) 3

Metadata vs. Data Two dimensions of Open Data

E.g. data.gv.at:

attribute-value pairs CSVs, spreadsheets, PDFs, etc.

4

EU data portal

data.europa.eu

2012

10 years of open-data government initiatives

2009

First governmental

data portals

data.gov.uk & data.gov

Austrian portal

data.gv.at

2011

Google dataset

search

toolbox.google.com

/datasetsearch

2018

EU harvesting portal

europeandataportal.eu

2015

Some milestones:

5

Open Data as a Global Trend

Country URL Datasets

United States data.gov 304k

Canada open.canada.ca 81k

UK data.gov.uk 46.5k

France www.data.gouv.fr 36.4k

Japan data.go.jp 22.4k

Russia data.gov.ru 21.5k

Germany govdata.de 21.5k

Italy dati.gov.it 20.5k

Data portals of the G8 countries:

6

What do we find on Open Data Portals?

7


8


9

Given a corpus of tabular Open Data resources,

the metadata descriptions can be enriched,

and therefore the data quality can be increased,

by semantically analyzing Open Data CSVs,

assigning semantic labels,

and integrating the extracted knowledge into a graph.

Hypothesis

10

identify quality issues and build a corpus

find improvement strategies and(re-)publish as homogenized/standardized Linked Data

develop a method to find and rank candidates of semantic context descriptions for numerical values

resolve entities in metadata descriptions and resources, and add links to the respective locations and time periods

Approach

Monitoring and analysis of Open Data portals

Evaluate applicability of existing techniques

Labeling and classification of numerical values

Extraction of spatial and temporal information

Metadata

Data

11




Extraction of spatial, and temporal information

identify metadata quality issues and build a corpus

12

● [Reiche et al., 2014]: Small sample set of 10 portals, automated assessment completeness, accuracy, readability, misspelling, …

● [Zaveri et al., 2015]: Survey on Linked Open Data quality assessment transparency and openness aspects not covered

● Global Open Data Index: Manual expert evaluation of defined data categories/key datasets

● [Veljković et al., 2014]: A theoretical model for openness in e-government primary, authenticity, understandability, ...

Openness & Transparency Evaluations in the Literature

13

Quality Metrics

Sebastian Neumaier, Jürgen Umbrich, and Axel Polleres. Automated quality assessment of metadata across open data portals. ACM Journal of Data and Information Quality (JDIQ), 2016.

● 16 specific metrics along 4 dimensions● automated and scalable assessment

14

Open Data Portal Watch

Evolution of Portals and Metrics

https://data.wu.ac.at/portalwatch/ 15

https://data.wu.ac.at/portalwatch/


Identified Challenges

● Metadata is heterogeneous and (partially) messy➢ Software-specific metadata (CKAN vs Socrata vs …)➢ Portal-specific metadata➢ Missing metadata (file formats, API descriptions, …)

● Metadata not available as Linked Data➢ Only partially in common vocabulary

● Poor discoverability of datasets➢ No content information in metadata (e.g., CSV headers)➢ Datasets’ metadata not optimized for search engines

16


Use of existing techniques



to improve and (re-)publish the metadata as homogenized/standardized Linked Data

17

Approach

1 Mapping to standard vocabularies

2 Enrich the datasets

3 Enable access

18

● Mappings of metadata from CKAN, Socrata and OpenDataSoft portals➢ Including mappings of most frequent (portal/domain specific) metadata fields

● Mapping and publishing of all descriptions as Schema.org:➢ Enables the integration into knowledge graphs of major search engines

1 Mapping to standard vocabularies

19



CSV file

CSV metadata

Mapping to standard

vocabularies

20

● Portal Watch quality dimensions:➢ Measurements per dataset



CSV file

CSV metadata

Quality

Measures

21


● Metadata for tables:➢ CSV dialect, headers, column types, ...



CSV file

CSV metadataCSV

metadata

22


● Metadata for tables:➢ CSV dialect, headers, column types, ...

● Record provenance:➢ Versioning, track modifications



CSV file

CSV metadata

23

● SPARQL endpoint:➢ Versions (snapshots) stored as named graphs➢ ~180 million triples each

● Access & query historical data:➢ Timestamp-based interfaces

● Schema.org via sitemap.xml:➢ Publishing of all datasets as HTML-embedded Schema.org

3 Enable access

Sebastian Neumaier, Jürgen Umbrich, and Axel Polleres. Lifting data portals to the Web of Data. In WWW ’17 Workshop on Linked Data on the Web (LDOW2017), Perth, Australia, April 2017. 24

Example Table

federal state district year sex population

Upper Austria Linz 2013 male 98157

Upper Austria Steyr 2013 male 18763

Upper Austria Wels 2013 male 29730

… … … … …

25

Open Data CSVs look more like this

Source: https://www.data.gv.at/katalog/dataset/e108dcc3-1304-4076-8619-f2185c37ef81

NUTS2 LAU2_NAME YEAR SEX AGE_TOTAL

AT31 Linz 2013 1 98157

AT31 Steyr 2013 1 18763

AT31 Wels 2013 1 29730

… … … …

26

https://www.data.gv.at/katalog/dataset/e108dcc3-1304-4076-8619-f2185c37ef81





develop a method to find and rank candidates of semantic context descriptions for numerical values

27

Use the numeric values in the tables

▪ Identifying the most likely semantic label for a bag of numerical values

▪ Deliberately ignore surroundings


AT31 Linz 2013 1 98157

AT31 Steyr 2013 1 18763

AT31 Wels 2013 1 29730

… … … …

28




98157

18763

29730

…

29


population (a district) (country Austria)

98157

18763

29730

…



30

Background Knowledge Graph

▪ Find properties with numerical range

▪ Hierarchical clustering approach

▪ Two hierarchical layers:

▪ Type hierarchy (using OWL classes)

▪ Property-object hierarchy (shared property-object pairs)

31

Label based on Nearest Neighbors

1

234

5

6

32

Labelling Results

33

populationTotal (a Settlement) populationDensity (a City)

33

Lessons Learned

● We can assign fine-grained semantic labels➢ If there is enough evidence in BK

● However: Missing domain knowledge for labelling OD

Conclusions:

● Complementary to existing approaches (column header labeling, entity linking and relation extraction)➢ Combined approaches may improve results

● Focusing on core dimensions of specific domains e.g. city data, maybe more promising than “general” value labelling

Sebastian Neumaier, Jürgen Umbrich, Josiane Xavier Parreira, and Axel Polleres. Multi-level semantic labelling of numerical values. In Proceedings of the 15th International Semantic Web Conference (ISWC 2016), Kobe, Japan, October 2016. Nominated for best student paper award. 34


AT31 Linz 2013 1 98157

AT31 Steyr 2013 1 18763

AT31 Wels 2013 1 29730

… … … …

Focus on specific dimensions:

▪ Particularly temporal and geospatial queries require better support [2]

What else can we do/use?

[2] Emilia Kacprzak, et al.: Characterising dataset search — An analysis of search logs and data requests. Journal of Web Semantics (2019) 35





resolve entities in metadata descriptions and resources, and add links to the respective locations and time periods

36

Available Geospatial Knowledge Bases

37

Wikidata links

Wikidata links

European Classification of Territorial Units

Wikidata, GeoNames

Mapping OSM entities to GeoNames regions

Extracting OSM streets and places

Geo-Knowledge Graph Construction

38

Available Temporal Knowledge

39

}}

Temporal Knowledge Graph Construction

● Named events and their labels

● Links to parent periods

● Links to the spatial coverage

● Temporal extent:

a single start and end date

40

Spatio-temporal labelling

Table cell value disambiguation

▪ Row context:

▪ Filter candidates by potential parents (if available)

▪ Column context:

▪ Least common ancestor of the spatial entities

Metadata descriptions

▪ Restrict annotation to origin country

▪ Temporal tagging using the Heideltime framework [3]

[3] Strötgen, Gertz: Multilingual and Cross-domain Temporal Tagging. Language Resources and Evaluation, 2013. 41

Evaluations

Sample evaluation on record level:

● 11 portals● 10 random CSVs● 10 random rows per dataset● i.e. 1100 inspected values

Discussion:

● Partially incomplete knowledge● Incomplete mapping of OSM● Heuristics for portal-specifics

would be required

42Sebastian Neumaier and Axel Polleres. Enabling Spatio-Temporal Search in Open Data. Journal of Web Semantics (JWS), 2018.

Demo: Geo-entity Search “Leopoldstadt”

http://data.wu.ac.at/odgraphsearch/

43

http://data.wu.ac.at/odgraphsearch/

The Portal Watch focuses on continuous metadata quality and archiving. A scalable, continuous profiling and archiving of the actual data is still missing.



(Re-)publish the improved dataset descriptions as Linked Data


Conclusions & Critical Discussions

44







We developed the methods, and publish mapped and enriched metadata. Ideally, however, the users find the LD endpoints and rich descriptions already at the data portals.

45








We can assign fine-grained semantic labels if there is enough evidence in BK. However, there is missing domain knowledge for labelling OD.

46








We annotate CSV tables and metadata at scale, with links to spatial and temporal entities, and a search and query interface.

It is still open if our approach is generalisable for other entities such as categories, (governmental) organisations, etc.

We can assign fine-grained semantic labels if there is enough evidence in BK. However, there is missing domain knowledge for labelling OD.

47

Impact

● PhD builds on top of my master thesis (winner of the OCG-Förderpreis by the Austrian Computer Society)

● Publications:




Extraction of spatial and temporal information

● Project & Community Work➢ FFG Projects on Open Data: ADEQUATE, Communidata➢ W3C working groups: CSV on the Web, Dataset Exchange (DXWG)➢ Open Source projects: github.com/sebneu

● Integration of dataset assessments & improvements in data.gv.at ● Re-published corpus gets harvested by Google Dataset Search

Automated quality assessment of metadata [OBD 2015] (best paper award), [JDIQ 2016]Measures for assessing the data freshness/up-to-dateness [OBD 2016]Comparison of metadata quality [GIQ 2018]

Lifting data portals to the Web of Data [LDOW 2017]

Labelling of numerical values [ISWC 2016] (nominated for best student paper)

Geo-semantic labelling of open data [SEMANTiCS 2018]Enabling Spatio-Temporal Search in Open Data [JWS 2019]

48

http://data.gv.at

https://toolbox.google.com/datasetsearch/search?query=site%3Adata.wu.ac.at&docid=8069xaowV5gVk1xpAAAAAA%3D%3D

Backup Slides

49

Given a corpus of tabular Open Data resources, the metadata descriptions can be enriched, and therefore the data quality can be increased, by semantically analyzing Open Data CSVs,

assigning semantic labels, and integrating the extracted knowledge into a graph.

How can we use existing Semantic Web technologies?

➢ Report and analysis of current OD in order to select/filter methods

How to best describe and publish datasets?

➢ Standardized W3C vocabularies and interfaces to publish Linked Data and enable integration

How to best find and assign semantic labels to datasets?

➢ Labeling of numeric data, extracting spatial & temporal information

Research Question & Hypothesis

50


Datasets and resources of the monitored portals51


Numerical Labelling

53

Evaluation Setup

• Data• DBPedia 3.9

• 50 most frequent numerical properties

• Distance functions• euclidean distance (min, max, mean, stddev)

• distribution similarity (Kolmogorov-Smirnov (KS) distance)

54

54

• AGGREGATION FUNCTION• majority vote and average distance

• AGGREGATION LEVELS• property• exact type

• 30 GB RAM• 3 different knowledge bases

Evaluation Setup

55

• train/test split : 80/20• 20% of the subjects for each property as test data• test context graph: similar as background construction,

however, without constraints• randomly select leaf nodes

Test/Training Data

56

• Best:Kolmogorov-Smirnov (KS) distance exact = correct property, type and p-o

prop = correct property

type = correct type

stype = correct super type

Evaluation: Distance Measure

57

• 9% of test nodes are contained 1-1 in knowledge graph !!

• aggregation

• majority and average vote

• different neighbours

• majority vote slightly better

• more neighbours also better

Evaluation: Large-Scale (33657 Test Nodes)

58

● labelling numerical columns● manual inspection of top 100 tables ( based on distance)

Findings• Dealing with timeline data:

values for different time points -> not in DBPedia

• missing domain knowledgereports about spendings, election results, tourism

• Aggregation of column scores: especially for type detection ( majority vote over column types)

• Combine with complementary approaches

Evaluation: Open Data Tables

59

● [Nguyen et al., 2019]: EmbNum+: Effective, Efficient, and Robust Semantic Labeling for Numerical Valuesneural embedding for learning representations and similarity metric from numerical columns

● [Kacprzak et al., 2018]: Making Sense of Numerical Data - Semantic Labelling of Web Tables

● [Alobaid et al., 2018]: Fuzzy Semantic Labeling of Semi-structured Numerical Datasets

● [Alobaid et al., 2019]: Typology-based Semantic Labeling of Numeric Tabular Datataking into account different kinds of numeric values

60

Related follow-up work on labeling numerical values in OD

Spatio-temporal Labelling

61

Open Data is about Locations

62

Faceted query interface:

▪ Full-text queries

▪ Geo-entity queries

▪ Timespan & Time pattern

▪ SPARQL endpoint

Back end:

▪ MongoDB for efficient key look-ups

▪ ElasticSearch for indexing and full-text queries

▪ Virtuoso as a triple store

Interface

63

or: how to build an open data knowledge graph · 82 data portals 160k datasets unknown format...

Documents