spimbench: a scalable, schema-aware instance matching benchmark for the semantic publishing domain

SPIMBENCH: A Scalable, Schema-Aware

Instance Matching Benchmark for the Semantic Publishing Domain

T. Saveta1, E. Daskalaki1, G. Flouris1, I. Fundulaki1,

M. Herschel2, A.-C. Ngonga Ngomo3

#1 FORTH-ICS, #2 University of Stuttgart, #3 University of Leipzig

Semantic Publishing Instance Matching Benchmark (SPIMBENCH) 2

Instance Matching in Linked Data

Data acquisition

Data

evolution

Data integration

Open/social data

How can we automatically recognizemultiple mentions of the same entity

across or within sources?=

Instance Matching


Benchmarking

Instance matching research has led to the development of various systems and algorithms.

How to compare these?

How can we assess their performance?

How can we push the systems to get better?

These systems need to be benchmarked


SPIMBENCH

• Based on Semantic Publishing Benchmark (SPB) of Linked Data Benchmark Council (LDBC)

• Synthetic benchmark for the Semantic Publishing Domain

• Value-based, structure-based and semantics-aware transformations [FMN+11, FLM08]

• Deterministic, scalable data generation in the order of billion triples

• Weighted gold standard


Instance Matching Benchmark Ingredients [FLM08]

Benchmark

Datasets

Gold Standard

Test Cases

Metrics


SPIMBENCH Model


Value & Structure Based Transformations

Value: Mainly typographical errors and the use of different data formats.[FMN+11]

Structure: Changes that occur to the properties.

– Property Addition/Deletion

– Property Aggregation/Extraction

Blank Character Addition/Deletion Change Number

Random Character Addition/Deletion/Modification Synonym/Antonym

Token Addition/Deletion/Shuffle Abbreviation

Multi-linguality (65 supported languages) Stem of a Word

Date Format


Semantics-Aware Transformations

Test if matching systems consider schema information to discover instance matches.

• Instance (in)equality constructs• owl:sameAs, owl:differentFrom

• Equivalence classes, properties• owl:equivalentClass, owl:equivalentProperty

• Disjointness classes, properties• owl:disjointWith, owl:propertyDisjointWith

• RDFS hierarchies• rdfs:subClassOf, rdfs:subPropertyOf

• Property constraints• owl:FunctionalProperty, owl:InverseFunctionalProperty

• Complex class definitions• owl:unionOf, owl:intersectionOf


SPIMBENCH Model


Weighted Gold Standard

• Detailed GS for debugging reasons

• Final GS : Contains only URIs that we consider a match and their similarity

spimbench:Match owl:Thing

spimbench:ValueTransf spimbench:StructureTransf spimbench:SemanticsAwareTransf

spimbench:Transformation

spimbench:VT1 spimbench:VTi

spimbench:ST1 spimbench:STispimbench:SAT1

…

spimbench:SATi

…

…

rdfs:subPropertyOf

rdfs:subClassOf

rdf:type

c

spimbench:source

spimbench:target

spimbench:weight xsd:string

spimbench:onProperty rdf:Property

spimbench:transformation


Scalability Experiments (1/2)

• Scalability experiments for datasets up to 500M triples

• 1000 triples ~ 36 entities

• Data generation along with data transformation is linear to the size of triples

• Transformation overhead is negligible for value-based, structure-based, semantics-aware and simple combinations

• Overhead for complex combinations is higher by one magnitude


Scalability Experiments (2/2)


Performance of LogMap [JG11]

Performance of LogMap for 10K triples Performance of LogMap for 25K triples

Performance of LogMap for 50K triples


Conclusions

• Schema aware variations

– Complex class definitions

– Property constraints

– Equivalence, Disjointness, etc.

• Combination of transformations

• Scalable data generation in order of billion triples

– Uses sampling

• Weighted gold standard

– Final gold standard

– Detailed gold standard for debugging reasons


Future Work

• SPIMBENCH will be used as one of the Ontology Alignment Evaluation Initiative [OAEI]benchmarks for 2015.

• Domain independent instance matching test case generator.

• Definition of more sophisticated metrics that takes into account the difficulty (weight).


Acknowledgments

This work was partially supported by the ongoing FP7 European Project LDBC (Linked Data Benchmark Council) (317548) and is done in collaboration with I. Fundulaki, M. Herschel (University of Stuttgart), G. Flouris, E. Daskalaki and A. C. Ngonga Ngomo (University of Leipzig)


References

# Reference Abbreviation

1A. Ferrara and D. Lorusso and S. Montanelli and G. Varese.Towards a Benchmark for Instance Matching. In OM, 2008.

[FLM08]

2A. Ferrara and S. Montanelli and J. Noessner and H. Stuckenschmidt. Benchmarking Matching Applications on the Semantic Web. In ESWC, 2011.

[FMN+11]

3M. Nickel and V. Tresp. Tensor Factorization for Multi-relational Learning. Machine Learning and Knowledge Discovery in Databases. Springer Berlin Heidelberg, 2013. 617-621.

[NV13]

4J. M. Joyce . Kullback-Leibler Divergence. International Encyclopedia of Statistical Science. Springer Berlin Heidelberg, 2011. 720-722.

[J11]

5E. Jimenez-Ruiz and B. C. Grau. Logmap: Logic-based and scalable ontology matching. In ISWC, 2011.

[JG11]

6B. Fuglede and F. Topsoe. Jensen-Shannon divergence and Hilbert space embedding, in IEEE International Symposium on Information Theory, 2004.

[FT04]

7Ontology Alignment Evaluation Initiative, find at http://oaei.ontologymatching.org/

[OAEI]

Thank you!Questions?

spimbench: a scalable, schema-aware instance matching benchmark for the semantic publishing domain

Technology