albert merono-penuela: understanding change in versioned web-knowledge organisation systems (kos)

37
Understanding Change in Versioned KOS on the Web Albert MeroñoPeñuela Christophe Guéret Stefan Schlobach @albertmeronyo EvoluFon and variaFon of classificaFon systems – KnoweScape workshop 04032015

Upload: cost-action-td1210

Post on 24-Jul-2015

130 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Understanding  Change  in  Versioned  KOS  on  the  Web  

Albert  Meroño-­‐Peñuela  Christophe  Guéret  Stefan  Schlobach  

 @albertmeronyo  

 EvoluFon  and  variaFon  of  classificaFon  systems  –  KnoweScape  workshop  

04-­‐03-­‐2015  

Page 2: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

CEDAR:  Harmonizing  Historical  Census  Data  in  the  SemanFc  Web  

Page 3: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

CEDAR:  Harmonizing  Historical  Census  Data  in  the  SemanFc  Web  

Page 4: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

CEDAR:  Source  Historical  Data    

Dutch  Historical  Censuses  (1795-­‐1971)    [Public  Historical  StaFsFcal  Data]  

   

Page 5: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

5  

From  scans  to  spreadsheets  

Page 6: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Uniform  queries  on  the  Web  

1795    1830    1840    1849    1859    1869    1879    1889    1899    1909    1919    1920    1930    1947    1956    1960    1971  

(through  ~3K  heterogeneous  tables)  

Page 7: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

RDF  Data  Cube  

“There  are  many  situaFons  where  it  would  be  useful  to  be  able  to  publish  mulF-­‐

dimensional  data,  such  as  staFsFcs,  on  the  web  in  such  a  way  that  they  can  be  linked  

to  related  data  sets  and  concepts.”  

Page 8: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)
Page 9: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)
Page 10: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

RDF  Data  Cube  vocabulary  (QB)  •  SDMX  compaFble  •  Defines  cubes  as  a  set  of  observa*ons  that  consist  of  

dimensions,  measures  and  a/ributes  

•   Dimensions:  Fme  period,  region,  sex  (qb:DimensionProperty)•   Measure:  populaFon  life  expectancy  (qb:MeasureProperty)  •   Ajribute:  unit  of  measure  =  years,  metadata  status  =  measured  (qb:AttributeProperty)  

ObservaFon:  “the  measured  life  expectancy  of  males  in  Newport  in  the  period  2004-­‐2006  is  76.7  years”  

Page 11: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Dynamic  ClassificaFons  

•  Gemeentegeschiedenis.nl  

Page 12: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Dynamic  ClassificaFons  

hjp://lod.cedar-­‐project.nl/maps/  (kudos  to  Richard  Zijdeman)  

Page 13: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Dynamic  ClassificaFons  

•  HISCO  

hjp://historyofwork.iisg.nl/  

Page 14: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

LSD  Dimensions  

hjp://lsd-­‐dimensions.org/  hjps://github.com/albertmeronyo/LSD-­‐Dimensions  

Daily  JSON-­‐LD  dumps      

Page 15: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

hjp://lsd-­‐dimensions.org/  

Page 16: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Concept Drift

  Census  classificaFon  of  occupaFons  as  for    

 1859  

•  Root  node  is  void  •  Depth  1:  occupaFon  groups  •  Leaves:  actual  occupaFons  

Page 17: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Concept Drift

  Census  classificaFon  of  occupaFons  as  for    

 1889  

•  Root  node  is  void  •  Depth  1:  occupaFon  groups  •  Leaves:  actual  occupaFons  

Page 18: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Concept Drift

  Census  classificaFon  of  occupaFons  as  for    

 1899  

•  Root  node  is  void  •  Depth  1:  occupaFon  groups  •  Leaves:  actual  occupaFons  

Page 19: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Concept  Dris  

Upper ontologies (HISCO, AC)

Year-dependent ontologies

1859 1869 1879

Page 20: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Concept  Dris  

Upper ontologies (HISCO, AC)

Year-dependent ontologies

Page 21: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Concept  Dris  

Upper ontologies (HISCO, AC)

Year-dependent ontologies

? ?

Page 22: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

PredicFng  Change  

•  KOS  version  chains:  subsequent  unique  version  iden*fiers  to  unique  states  of  KOS  

•  ProblemaFc  for  – Data  publishers  (KOS  maintainability)  – Data  users/linkers  (link  validity)  

A.  Meroño-­‐Peñuela,  C.  Guéret,  S.  Schlobach.  Predic1ng  Change  in  Versioned  Knowledge  Organisa1on  Systems  on  the  Web.  IJCAI  2015  (under  review)  

Page 23: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

PredicFng  Change  •  Proposal:  generic  approach  to  predict  when  and  where  a  Web  KOS  of  any  domain  will  change  – Using  supervised  learning  on  past  versions  of  KOS  

•  SotA1:  predicFon  of  class  extension  in    –  1  OBO/OWL  version  chain  (Gene  Ontology)  –  using  few  classifiers  

•  Contribu1on2:  predicFon  of  concept  dri:  in    –  150  Web  KOS  version  chains  –  using  all  (21)  SotA  classifiers  (WEKA  API)  

2  A.  Meroño-­‐Peñuela,  C.  Guéret,  S.  Schlobach.  “Predic1ng  Change  in  Versioned  Knowledge  Organisa1on  Systems  on  the  Web”.  IJCAI  2015  (under  review)  

1  C.  Pesquita,  F.M.  Couto.  “Predic1ng  the  extension  of  biomedical  ontologies”.  PLoS  computa1onal  biology  8  (9),  e1002630      

Page 24: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Concept  Dris  

•  Proxy  for  change  of  meaning  over  Fme1  –  Intension  dri:  occurs  when  there  is  a  difference  in  the  properFes  or  ajributes  of  two  variants  of  the  same  concept  

– Extension  dri:  occurs  when  there  is  a  difference  in  the  individuals  that  belong  to  two  variants  of  the  same  concept  

– Label  dri:  occurs  when  there  is  a  difference  in  the  labels  of  two  variants  of  the  same  concept  

1  S.  Wang,  S.  Schlobach,  K.  Klein.  “What  Is  Concept  DriR  and  How  to  Measure  It?”.  EKAW  2010.  

Page 25: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Input  Datasets  

KOS  version  chains  from  •  HISCO/CEDAR  (1  version  chain)  •  DBpedia  (2  version  chains)  •  Linked  Open  Vocabularies1  (134  version  chains)  •  *Ontology  chains  from  637  SPARQL  endpoints2  (6  version  chains)  

1  hjp://lov.okfn.org/      2  hjps://github.com/albertmeronyo/ConceptDris-­‐data/tree/master/src    

Page 26: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Features  

•  From  which  data  characterisFcs  (related  to  change)  should  we  learn?  

•  SotA  in  Ontology  Change  [Stojanovic  2004]  – Structure-­‐driven  (rdfs:subClassOf,  skos:broader)  

•  maxDepth,  children,  parents,  siblings  – Data-­‐driven  (rdf:type)  

•  members,  childMembers,  parentMembers,  siblingMembers  

– Usage-­‐driven  •  incExtLinks  (on  the  Web!)  

Page 27: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Pipeline  

hjps://github.com/albertmeronyo/ConceptDris    

Page 28: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

EvaluaFon  

•  Use  a  subset  of  past  versions  for  learning  (Vt)  •  Check  whether  changed  happened  by  observing  Vr,  Ve  

Page 29: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Results  –  classifier  performance  

CEDAR/HISCO  classificaFon  performance  over  Fme  

Dbpedia  ontology  classificaFon  performance  over  Fme  

Page 30: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Results  –  understanding  performance  

RelaFonship  between  characterisFcs  of  input  version  chains  and  selected  classifiers  /  performance?    •  totalSize  •  nSnapshots  •  avgGap  •  avgTreeDepth  •  ra1oInstances  •  ra1oStructural  •  ra1oInserts  •  ra1oDeletes  •  ra1oComm  

f(xi)?  q  roc  q  classifier  

Page 31: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)
Page 32: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)
Page 33: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Table 1:

Dependent variable:

functions rules trees functions rules trees functions rules trees

(1) (2) (3) (4) (5) (6) (7) (8) (9)

log(nSnapshots) �0.291 �0.257 1.975 �0.180 �0.239 1.745 �0.193 �0.212 1.838

(0.656) (0.765) (1.503) (0.680) (0.790) (1.512) (0.667) (0.777) (1.497)

log(avgGap) 0.238 0.145 1.385

⇤0.266 0.173 1.269

⇤0.248 0.161 1.351

(0.242) (0.271) (0.734) (0.240) (0.269) (0.703) (0.240) (0.270) (0.729)

log(totalSize) 0.669

⇤⇤⇤0.539

⇤ �0.052 0.636

⇤⇤0.531

⇤ �0.010 0.641

⇤⇤⇤0.524

⇤ �0.025

(0.249) (0.278) (0.563) (0.251) (0.282) (0.555) (0.249) (0.279) (0.557)

avgTreeDepth �0.399 �0.334 0.534 �0.393 �0.336 0.564 �0.385 �0.323 0.553

(0.302) (0.330) (0.719) (0.304) (0.334) (0.728) (0.303) (0.332) (0.728)

ratioInstances 1.378 2.463 3.090 1.071 2.246 3.394 1.269 2.330 3.221

(3.485) (4.021) (6.654) (3.455) (3.981) (6.629) (3.476) (4.005) (6.649)

ratioStructural �9.054 1.357 �9.539 �9.039 1.674 �10.799 �9.594 1.116 �10.030

(6.040) (6.135) (13.505) (6.142) (6.353) (13.945) (6.136) (6.267) (13.827)

ratioInserts 3.006 2.376 �3.540

(1.906) (2.210) (4.401)

ratioDeletes 1.918 0.929 �2.341

(1.907) (2.154) (4.058)

ratioComm �1.440 �0.945 1.615

(1.028) (1.170) (2.219)

Constant �5.610

⇤⇤ �5.580

⇤⇤ �12.702

⇤⇤ �5.288

⇤⇤ �5.259

⇤⇤ �12.402

⇤⇤ �4.059

⇤ �4.494

⇤ �14.266

⇤⇤

(2.248) (2.511) (5.954) (2.210) (2.494) (5.759) (2.265) (2.585) (6.511)

Akaike Inf. Crit. 313.543 313.543 313.543 316.179 316.179 316.179 314.605 314.605 314.605

Note:

⇤p<0.1;

⇤⇤p<0.05;

⇤⇤⇤p<0.01

1

Classifier  SelecFon  

Page 34: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

SimulaFon  of  avgGap  VS  Classifier  Family  SelecFon  

Page 35: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Conclusions  

•  SemanFc  technology  for  Social  History  –  It  saved  work!  

•  Historical  datasets  as  an  observatory  of  dynamic  KOS  –  Logging  usage  of  KOS  in  Linked  StaFsFcal  Data  

•  Modeling  change  in  Web  KOS  –  Version  chains  are  scarce  (beware  of  bias)  –  Chain  recipe:  nSnapshots,  avgTreeDepth,  raFoStructural,  raFoInserts,  raFoComm  

–  Classifier  dependence:  avgGap,  totalSize  

Page 36: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Thank you

Questions, suggestions, comments most welcome

@albertmeronyo

https://github.com/albertmeronyo/ConceptDrift

http://www.cedar-project.nl http://krr.cs.vu.nl/

http://easy.dans.knaw.nl/ http://lsd-dimensions.org/

Page 37: Albert Merono-Penuela: Understanding Change in Versioned Web-Knowledge Organisation Systems (KOS)

Me  in  6  tweets  hjp://www.albertmeronyo.org  

•  Background:  Computer  Science,  Web  hacker,  AI  &  Law  •  PhD  candidate  at  the  VU  University  Amsterdam,  DANS,  and  eHumaniFes  group  (KNAW)  

•  Topic:  SemanFc  Web  for  the  HumaniFes    •  CEDAR  project  (2012-­‐2015):  harmonized  historical  Dutch  censuses  in  the  SemanFc  Web    

•  Problem:  staFsFcal  data  publishing,  concept  dris  and  dynamics  of  meaning    

•  Last  paper:  What  is  Linked  Historical  Data?  (EKAW  2014)