discovering related data sources in data portals
DESCRIPTION
Slides from my presentation at the 1st International Workshop on Semantic Statistics Sydney, Oct 22, 2013TRANSCRIPT
Discovering Related Data Sources in Data Portals
Andreas Wagner, Peter Haase, Achim Re4nger, Holger Lamm
1st Interna:onal Workshop on Seman:c Sta:s:cs
Sydney, Oct 22, 2013
WORLD BANK
Poten&al of Open (Sta&s&cs) Data
WORLD BANK
fluidOps Open Data Portal • Data collec&on • Integra&on of major open data catalogs • Automated provisioning of 10.000s data sets
• Portal for search and explora&on of data sets • Rich metadata based on open standards • Both descrip&ve and structural metadata
• Integrated querying across interlinked data sets • Easy to use queries against mul&ple data sets • Using federa&on technologies
• Self-‐service UI • Custom queries and visualiza&ons • Widgets, dashboarding, etc.
Finding Related Data Sets • Many informa&on needs require analysis of mul&ple data sets
• Example: Compare and correlate GDP, popula&on and public debt of countries over &me
• Task of finding related data sets • Iden&fy data sets that are similar, but complementary • To support queries across mul&ple data sets, e.g. in the form of joins
and unions
• Inspira&on: Finding related tables • En&ty complement: same aVributes, complemen&ng en&&es • Schema complement: same en&&es, complemen&ng aVributes
Finding Related Data Sources via Related En&&es
• Data Model: Data source is a set of mul&ple RDF graphs
• Intui&on: if data sources contain similar en&&es, they are somehow related
• Approach: 1. En&ty Extrac&on 2. En&ty Similarity 3. En&ty Clustering
En&&es
Source 3
Cluster 2
Related?!
Cluster 1
Source 2 Source 1
Related En&&es (2) 1. En&ty Extrac&on – Sample over en&&es in data graphs in D – For each en&ty crawl its surrounding sub-‐graph [1]
2. En&ty Similarity – Define dissimilarity measure between two en&&es
based on kernel func&ons – Compare en&ty structure and literals via different
kernels [2,3] 3. En&ty Clustering – Apply k-‐means clustering to discover similar
en&&es [4]
Contextualisa&on Score
• Contextualiza&on score for data source D’’ given D’: ec(D’’|D’) and sc(D’’|D’)
• En*ty complement score
• Schema complement score
Search for Gross Domes&c Product
Querying the Data Set
Visualizing the Results
Queries Across Related Data Sets • Query for GDP of Germany
• Union of results from • Worldbank: GDP (current US$ ) (up to 2010) • Eurostat: GDP at Market Prices (including projected values un&l 2014)
Queries Across Related Data Sets
Data from Eurostat Data from Worldbank
Summary and Outlook • Techniques for finding related data sets – Based on finding related en&&es
• Implementa&on available in open data portal
• Outlook – Finding relevant related data sources for a given informa&on need
– End user interfaces for formula&ng queries across data sets (see Op&que project)
– Operators for combining data cubes – Interac&ve visualiza&on and explora&on of combined data cubes (see OpenCube project)
References
[1] G. A. Grimnes, P. Edwards, and A. Preece. Instance based clustering of seman:c web resources. In ESWC, 2008.
[2] U. Lösch, S. Bloehdorn, and A. Reenger. Graph kernels for RDF data. In ESWC, 2012.
[3] J. Shawe-‐Taylor and N. Cris&anini. Kernel Methods for PaPern Analysis. 2004.
[4] R. Zhang and A. Rudnicky. A large scale clustering scheme for kernel k-‐means. In PaVern Recogni&on, 2002.