or: how to build an open data knowledge graph · 82 data portals 160k datasets unknown format...
TRANSCRIPT
Semantic Enrichment of Open DataOr: How to build an Open Data Knowledge Graph
Sebastian [email protected] https://sebneumaier.wordpress.com/
Advisor: Dr. Axel PolleresReviewers: Dr. Christian Bizer, Dr. Elena Simperl
Rigorosum, TU Wien, November 20, 2019 Slides: tiny.cc/sebneu 1
Open Data comes in various ways
2
CSV (3-star)
Excel (2-star)
PDF (1-start)
82 data portals 160K datasets
Unknown format (1-star)
RDF? Not significant
Tim Berners-Lee’s 5-star deployment scheme for Open Data
Available data is only partially structured and not linked
Umbrich, J., Neumaier, S., Polleres, A.: Quality assessment & evolution of open data portals. International Conference on Open and Big Data (2015) 3
Metadata vs. Data Two dimensions of Open Data
E.g. data.gv.at:
attribute-value pairs CSVs, spreadsheets, PDFs, etc.
4
EU data portal
data.europa.eu
2012
10 years of open-data government initiatives
2009
First governmental
data portals
data.gov.uk & data.gov
Austrian portal
data.gv.at
2011
Google dataset
search
toolbox.google.com
/datasetsearch
2018
EU harvesting portal
europeandataportal.eu
2015
Some milestones:
5
Open Data as a Global Trend
Country URL Datasets
United States data.gov 304k
Canada open.canada.ca 81k
UK data.gov.uk 46.5k
France www.data.gouv.fr 36.4k
Japan data.go.jp 22.4k
Russia data.gov.ru 21.5k
Germany govdata.de 21.5k
Italy dati.gov.it 20.5k
Data portals of the G8 countries:
6
What do we find on Open Data Portals?
7
What do we find on Open Data Portals?
8
What do we find on Open Data Portals?
9
Given a corpus of tabular Open Data resources,
the metadata descriptions can be enriched,
and therefore the data quality can be increased,
by semantically analyzing Open Data CSVs,
assigning semantic labels,
and integrating the extracted knowledge into a graph.
Hypothesis
10
identify quality issues and build a corpus
find improvement strategies and(re-)publish as homogenized/standardized Linked Data
develop a method to find and rank candidates of semantic context descriptions for numerical values
resolve entities in metadata descriptions and resources, and add links to the respective locations and time periods
Approach
Monitoring and analysis of Open Data portals
Evaluate applicability of existing techniques
Labeling and classification of numerical values
Extraction of spatial and temporal information
Metadata
Data
11
Monitoring and analysis of Open Data portals
Evaluate applicability of existing techniques
Labeling and classification of numerical values
Extraction of spatial, and temporal information
identify metadata quality issues and build a corpus
12
● [Reiche et al., 2014]: Small sample set of 10 portals, automated assessment completeness, accuracy, readability, misspelling, …
● [Zaveri et al., 2015]: Survey on Linked Open Data quality assessment transparency and openness aspects not covered
● Global Open Data Index: Manual expert evaluation of defined data categories/key datasets
● [Veljković et al., 2014]: A theoretical model for openness in e-government primary, authenticity, understandability, ...
Openness & Transparency Evaluations in the Literature
13
Quality Metrics
Sebastian Neumaier, Jürgen Umbrich, and Axel Polleres. Automated quality assessment of metadata across open data portals. ACM Journal of Data and Information Quality (JDIQ), 2016.
● 16 specific metrics along 4 dimensions● automated and scalable assessment
14
Open Data Portal Watch
Evolution of Portals and Metrics
https://data.wu.ac.at/portalwatch/ 15
Identified Challenges
● Metadata is heterogeneous and (partially) messy➢ Software-specific metadata (CKAN vs Socrata vs …)➢ Portal-specific metadata➢ Missing metadata (file formats, API descriptions, …)
● Metadata not available as Linked Data➢ Only partially in common vocabulary
● Poor discoverability of datasets➢ No content information in metadata (e.g., CSV headers)➢ Datasets’ metadata not optimized for search engines
16
Monitoring and analysis of Open Data portals
Use of existing techniques
Labeling and classification of numerical values
Extraction of spatial, and temporal information
to improve and (re-)publish the metadata as homogenized/standardized Linked Data
17
Approach
1 Mapping to standard vocabularies
2 Enrich the datasets
3 Enable access
18
● Mappings of metadata from CKAN, Socrata and OpenDataSoft portals➢ Including mappings of most frequent (portal/domain specific) metadata fields
● Mapping and publishing of all descriptions as Schema.org:➢ Enables the integration into knowledge graphs of major search engines
1 Mapping to standard vocabularies
19
2 Enrich the datasets
Open Data Portal Watch
CSV file
CSV metadata
Mapping to standard
vocabularies
20
● Portal Watch quality dimensions:➢ Measurements per dataset
2 Enrich the datasets
Open Data Portal Watch
CSV file
CSV metadata
Quality
Measures
21
● Portal Watch quality dimensions:➢ Measurements per dataset
● Metadata for tables:➢ CSV dialect, headers, column types, ...
2 Enrich the datasets
Open Data Portal Watch
CSV file
CSV metadataCSV
metadata
22
● Portal Watch quality dimensions:➢ Measurements per dataset
● Metadata for tables:➢ CSV dialect, headers, column types, ...
● Record provenance:➢ Versioning, track modifications
2 Enrich the datasets
Open Data Portal Watch
CSV file
CSV metadata
23
● SPARQL endpoint:➢ Versions (snapshots) stored as named graphs➢ ~180 million triples each
● Access & query historical data:➢ Timestamp-based interfaces
● Schema.org via sitemap.xml:➢ Publishing of all datasets as HTML-embedded Schema.org
3 Enable access
Sebastian Neumaier, Jürgen Umbrich, and Axel Polleres. Lifting data portals to the Web of Data. In WWW ’17 Workshop on Linked Data on the Web (LDOW2017), Perth, Australia, April 2017. 24
Example Table
federal state district year sex population
Upper Austria Linz 2013 male 98157
Upper Austria Steyr 2013 male 18763
Upper Austria Wels 2013 male 29730
… … … … …
25
Open Data CSVs look more like this
Source: https://www.data.gv.at/katalog/dataset/e108dcc3-1304-4076-8619-f2185c37ef81
NUTS2 LAU2_NAME YEAR SEX AGE_TOTAL
AT31 Linz 2013 1 98157
AT31 Steyr 2013 1 18763
AT31 Wels 2013 1 29730
… … … …
26
Monitoring and analysis of Open Data portals
Evaluate applicability of existing techniques
Labeling and classification of numerical values
Extraction of spatial, and temporal information
develop a method to find and rank candidates of semantic context descriptions for numerical values
27
Use the numeric values in the tables
▪ Identifying the most likely semantic label for a bag of numerical values
▪ Deliberately ignore surroundings
NUTS2 LAU2_NAME YEAR SEX AGE_TOTAL
AT31 Linz 2013 1 98157
AT31 Steyr 2013 1 18763
AT31 Wels 2013 1 29730
… … … …
28
Use the numeric values in the tables
▪ Identifying the most likely semantic label for a bag of numerical values
▪ Deliberately ignore surroundings
98157
18763
29730
…
29
Use the numeric values in the tables
population (a district) (country Austria)
98157
18763
29730
…
▪ Identifying the most likely semantic label for a bag of numerical values
▪ Deliberately ignore surroundings
30
Background Knowledge Graph
▪ Find properties with numerical range
▪ Hierarchical clustering approach
▪ Two hierarchical layers:
▪ Type hierarchy (using OWL classes)
▪ Property-object hierarchy (shared property-object pairs)
31
Label based on Nearest Neighbors
1
234
5
6
32
Labelling Results
33
populationTotal (a Settlement) populationDensity (a City)
33
Lessons Learned
● We can assign fine-grained semantic labels➢ If there is enough evidence in BK
● However: Missing domain knowledge for labelling OD
Conclusions:
● Complementary to existing approaches (column header labeling, entity linking and relation extraction)➢ Combined approaches may improve results
● Focusing on core dimensions of specific domains e.g. city data, maybe more promising than “general” value labelling
Sebastian Neumaier, Jürgen Umbrich, Josiane Xavier Parreira, and Axel Polleres. Multi-level semantic labelling of numerical values. In Proceedings of the 15th International Semantic Web Conference (ISWC 2016), Kobe, Japan, October 2016. Nominated for best student paper award. 34
NUTS2 LAU2_NAME YEAR SEX AGE_TOTAL
AT31 Linz 2013 1 98157
AT31 Steyr 2013 1 18763
AT31 Wels 2013 1 29730
… … … …
Focus on specific dimensions:
▪ Particularly temporal and geospatial queries require better support [2]
What else can we do/use?
[2] Emilia Kacprzak, et al.: Characterising dataset search — An analysis of search logs and data requests. Journal of Web Semantics (2019) 35
Monitoring and analysis of Open Data portals
Evaluate applicability of existing techniques
Labeling and classification of numerical values
Extraction of spatial, and temporal information
resolve entities in metadata descriptions and resources, and add links to the respective locations and time periods
36
Available Geospatial Knowledge Bases
37
Wikidata links
Wikidata links
European Classification of Territorial Units
Wikidata, GeoNames
Mapping OSM entities to GeoNames regions
Extracting OSM streets and places
Geo-Knowledge Graph Construction
38
Available Temporal Knowledge
39
}}
Temporal Knowledge Graph Construction
● Named events and their labels
● Links to parent periods
● Links to the spatial coverage
● Temporal extent:
a single start and end date
40
Spatio-temporal labelling
Table cell value disambiguation
▪ Row context:
▪ Filter candidates by potential parents (if available)
▪ Column context:
▪ Least common ancestor of the spatial entities
Metadata descriptions
▪ Restrict annotation to origin country
▪ Temporal tagging using the Heideltime framework [3]
[3] Strötgen, Gertz: Multilingual and Cross-domain Temporal Tagging. Language Resources and Evaluation, 2013. 41
Evaluations
Sample evaluation on record level:
● 11 portals● 10 random CSVs● 10 random rows per dataset● i.e. 1100 inspected values
Discussion:
● Partially incomplete knowledge● Incomplete mapping of OSM● Heuristics for portal-specifics
would be required
42Sebastian Neumaier and Axel Polleres. Enabling Spatio-Temporal Search in Open Data. Journal of Web Semantics (JWS), 2018.
Demo: Geo-entity Search “Leopoldstadt”
http://data.wu.ac.at/odgraphsearch/
43
The Portal Watch focuses on continuous metadata quality and archiving. A scalable, continuous profiling and archiving of the actual data is still missing.
Monitoring and analysis of Open Data portals
Labeling and classification of numerical values
(Re-)publish the improved dataset descriptions as Linked Data
Extraction of spatial, and temporal information
Conclusions & Critical Discussions
44
The Portal Watch focuses on continuous metadata quality and archiving. A scalable, continuous profiling and archiving of the actual data is still missing.
Monitoring and analysis of Open Data portals
Labeling and classification of numerical values
(Re-)publish the improved dataset descriptions as Linked Data
Extraction of spatial, and temporal information
Conclusions & Critical Discussions
We developed the methods, and publish mapped and enriched metadata. Ideally, however, the users find the LD endpoints and rich descriptions already at the data portals.
45
The Portal Watch focuses on continuous metadata quality and archiving. A scalable, continuous profiling and archiving of the actual data is still missing.
Monitoring and analysis of Open Data portals
Labeling and classification of numerical values
(Re-)publish the improved dataset descriptions as Linked Data
Extraction of spatial, and temporal information
Conclusions & Critical Discussions
We developed the methods, and publish mapped and enriched metadata. Ideally, however, the users find the LD endpoints and rich descriptions already at the data portals.
We can assign fine-grained semantic labels if there is enough evidence in BK. However, there is missing domain knowledge for labelling OD.
46
The Portal Watch focuses on continuous metadata quality and archiving. A scalable, continuous profiling and archiving of the actual data is still missing.
Monitoring and analysis of Open Data portals
Labeling and classification of numerical values
(Re-)publish the improved dataset descriptions as Linked Data
Extraction of spatial, and temporal information
Conclusions & Critical Discussions
We developed the methods, and publish mapped and enriched metadata. Ideally, however, the users find the LD endpoints and rich descriptions already at the data portals.
We annotate CSV tables and metadata at scale, with links to spatial and temporal entities, and a search and query interface.
It is still open if our approach is generalisable for other entities such as categories, (governmental) organisations, etc.
We can assign fine-grained semantic labels if there is enough evidence in BK. However, there is missing domain knowledge for labelling OD.
47
Impact
● PhD builds on top of my master thesis (winner of the OCG-Förderpreis by the Austrian Computer Society)
● Publications:
Monitoring and analysis of Open Data portals
Evaluate applicability of existing techniques
Labeling and classification of numerical values
Extraction of spatial and temporal information
● Project & Community Work➢ FFG Projects on Open Data: ADEQUATE, Communidata➢ W3C working groups: CSV on the Web, Dataset Exchange (DXWG)➢ Open Source projects: github.com/sebneu
● Integration of dataset assessments & improvements in data.gv.at ● Re-published corpus gets harvested by Google Dataset Search
Automated quality assessment of metadata [OBD 2015] (best paper award), [JDIQ 2016]Measures for assessing the data freshness/up-to-dateness [OBD 2016]Comparison of metadata quality [GIQ 2018]
Lifting data portals to the Web of Data [LDOW 2017]
Labelling of numerical values [ISWC 2016] (nominated for best student paper)
Geo-semantic labelling of open data [SEMANTiCS 2018]Enabling Spatio-Temporal Search in Open Data [JWS 2019]
48
Backup Slides
49
Given a corpus of tabular Open Data resources, the metadata descriptions can be enriched, and therefore the data quality can be increased, by semantically analyzing Open Data CSVs,
assigning semantic labels, and integrating the extracted knowledge into a graph.
How can we use existing Semantic Web technologies?
➢ Report and analysis of current OD in order to select/filter methods
How to best describe and publish datasets?
➢ Standardized W3C vocabularies and interfaces to publish Linked Data and enable integration
How to best find and assign semantic labels to datasets?
➢ Labeling of numeric data, extracting spatial & temporal information
Research Question & Hypothesis
50
Open Data Portal Watch
Datasets and resources of the monitored portals51
52
Numerical Labelling
53
Evaluation Setup
• Data• DBPedia 3.9
• 50 most frequent numerical properties
• Distance functions• euclidean distance (min, max, mean, stddev)
• distribution similarity (Kolmogorov-Smirnov (KS) distance)
54
54
• AGGREGATION FUNCTION• majority vote and average distance
• AGGREGATION LEVELS• property• exact type
• 30 GB RAM• 3 different knowledge bases
Evaluation Setup
55
• train/test split : 80/20• 20% of the subjects for each property as test data• test context graph: similar as background construction,
however, without constraints• randomly select leaf nodes
Test/Training Data
56
• Best:Kolmogorov-Smirnov (KS) distance exact = correct property, type and p-o
prop = correct property
type = correct type
stype = correct super type
Evaluation: Distance Measure
57
• 9% of test nodes are contained 1-1 in knowledge graph !!
• aggregation
• majority and average vote
• different neighbours
• majority vote slightly better
• more neighbours also better
Evaluation: Large-Scale (33657 Test Nodes)
58
● labelling numerical columns● manual inspection of top 100 tables ( based on distance)
Findings• Dealing with timeline data:
values for different time points -> not in DBPedia
• missing domain knowledgereports about spendings, election results, tourism
• Aggregation of column scores: especially for type detection ( majority vote over column types)
• Combine with complementary approaches
Evaluation: Open Data Tables
59
● [Nguyen et al., 2019]: EmbNum+: Effective, Efficient, and Robust Semantic Labeling for Numerical Valuesneural embedding for learning representations and similarity metric from numerical columns
● [Kacprzak et al., 2018]: Making Sense of Numerical Data - Semantic Labelling of Web Tables
● [Alobaid et al., 2018]: Fuzzy Semantic Labeling of Semi-structured Numerical Datasets
● [Alobaid et al., 2019]: Typology-based Semantic Labeling of Numeric Tabular Datataking into account different kinds of numeric values
60
Related follow-up work on labeling numerical values in OD
Spatio-temporal Labelling
61
Open Data is about Locations
62
Faceted query interface:
▪ Full-text queries
▪ Geo-entity queries
▪ Timespan & Time pattern
▪ SPARQL endpoint
Back end:
▪ MongoDB for efficient key look-ups
▪ ElasticSearch for indexing and full-text queries
▪ Virtuoso as a triple store
Interface
63