discovering related data sources in data portals

17
Discovering Related Data Sources in Data Portals Andreas Wagner, Peter Haase , Achim Re4nger, Holger Lamm 1st Interna:onal Workshop on Seman:c Sta:s:cs Sydney, Oct 22, 2013

Upload: peter-haase

Post on 27-Jan-2015

106 views

Category:

Technology


1 download

DESCRIPTION

Slides from my presentation at the 1st International Workshop on Semantic Statistics Sydney, Oct 22, 2013

TRANSCRIPT

Page 1: Discovering Related Data Sources in Data Portals

Discovering  Related  Data  Sources    in  Data  Portals  

 Andreas  Wagner,  Peter  Haase,    Achim  Re4nger,  Holger  Lamm  

1st  Interna:onal  Workshop  on  Seman:c  Sta:s:cs  

Sydney,  Oct  22,  2013    

Page 2: Discovering Related Data Sources in Data Portals

WORLD BANK

Poten&al  of  Open  (Sta&s&cs)  Data  

Page 3: Discovering Related Data Sources in Data Portals

WORLD BANK

fluidOps  Open  Data  Portal  •  Data  collec&on  •  Integra&on  of  major  open  data  catalogs  •  Automated  provisioning  of  10.000s  data  sets  

•  Portal  for  search  and  explora&on  of  data  sets  •  Rich  metadata  based  on  open  standards  •  Both  descrip&ve  and  structural  metadata  

•  Integrated  querying  across  interlinked  data  sets  •  Easy  to  use  queries  against  mul&ple  data  sets  •  Using  federa&on  technologies  

•  Self-­‐service  UI  •  Custom  queries  and  visualiza&ons  •  Widgets,  dashboarding,  etc.  

Page 4: Discovering Related Data Sources in Data Portals
Page 5: Discovering Related Data Sources in Data Portals

Finding  Related  Data  Sets  •  Many  informa&on  needs  require  analysis  of  mul&ple  data  sets  

•  Example:  Compare  and  correlate  GDP,  popula&on  and  public  debt  of  countries  over  &me  

•  Task  of  finding  related  data  sets  •  Iden&fy  data  sets  that  are  similar,  but  complementary  •  To  support  queries  across  mul&ple  data  sets,  e.g.  in  the  form  of  joins  

and  unions  

•  Inspira&on:  Finding  related  tables  •  En&ty  complement:  same  aVributes,  complemen&ng  en&&es  •  Schema  complement:  same  en&&es,  complemen&ng  aVributes  

Page 6: Discovering Related Data Sources in Data Portals

Finding  Related  Data  Sources  via  Related  En&&es  

•  Data  Model:  Data  source  is  a  set  of  mul&ple  RDF  graphs  

•  Intui&on:  if  data  sources  contain  similar  en&&es,  they  are  somehow  related  

•  Approach:  1.  En&ty  Extrac&on  2.  En&ty  Similarity  3.  En&ty  Clustering  

En&&es  

Source  3  

Cluster  2  

Related?!  

Cluster  1  

Source  2  Source  1  

Page 7: Discovering Related Data Sources in Data Portals

Related  En&&es  (2)  1.  En&ty  Extrac&on  –  Sample  over  en&&es  in  data  graphs  in  D  –  For  each  en&ty  crawl  its  surrounding  sub-­‐graph  [1]  

2.  En&ty  Similarity  –  Define  dissimilarity  measure  between  two  en&&es  

based  on  kernel  func&ons  –  Compare  en&ty  structure  and  literals  via  different  

kernels  [2,3]  3.  En&ty  Clustering  –  Apply  k-­‐means  clustering  to  discover  similar    

 en&&es  [4]  

Page 8: Discovering Related Data Sources in Data Portals

Contextualisa&on  Score  

•  Contextualiza&on  score  for  data  source  D’’  given  D’:  ec(D’’|D’)  and  sc(D’’|D’)  

•  En*ty  complement  score  

•  Schema  complement  score  

Page 9: Discovering Related Data Sources in Data Portals
Page 10: Discovering Related Data Sources in Data Portals

Search  for  Gross  Domes&c  Product  

Page 11: Discovering Related Data Sources in Data Portals
Page 12: Discovering Related Data Sources in Data Portals

Querying  the  Data  Set  

Page 13: Discovering Related Data Sources in Data Portals

Visualizing  the  Results  

Page 14: Discovering Related Data Sources in Data Portals

Queries  Across  Related  Data  Sets  •  Query  for  GDP  of  Germany  

•  Union  of  results  from    •  Worldbank:  GDP  (current  US$  )  (up  to  2010)  •  Eurostat:  GDP  at  Market  Prices  (including  projected  values  un&l  2014)  

Page 15: Discovering Related Data Sources in Data Portals

Queries  Across  Related  Data  Sets  

Data  from  Eurostat  Data  from  Worldbank  

Page 16: Discovering Related Data Sources in Data Portals

Summary  and  Outlook  •  Techniques  for  finding  related  data  sets  –  Based  on  finding  related  en&&es  

•  Implementa&on  available  in  open  data  portal  

•  Outlook  –  Finding  relevant  related  data  sources  for  a  given  informa&on  need  

–  End  user  interfaces  for  formula&ng  queries    across  data  sets  (see  Op&que  project)  

–  Operators  for  combining  data  cubes  –  Interac&ve  visualiza&on  and  explora&on  of    combined  data  cubes  (see  OpenCube  project)  

Page 17: Discovering Related Data Sources in Data Portals

References  

[1]    G.  A.  Grimnes,  P.  Edwards,  and  A.  Preece.    Instance  based  clustering  of  seman:c  web    resources.  In  ESWC,  2008.  

[2]  U.  Lösch,  S.  Bloehdorn,  and  A.  Reenger.    Graph  kernels  for  RDF  data.  In  ESWC,  2012.  

[3]  J.  Shawe-­‐Taylor  and  N.  Cris&anini.  Kernel    Methods  for  PaPern  Analysis.  2004.  

[4]    R.  Zhang  and  A.  Rudnicky.  A  large  scale    clustering  scheme  for  kernel  k-­‐means.  In    PaVern  Recogni&on,  2002.