revealing the conceptual schemas of rdf datasets
TRANSCRIPT
Revealing the Conceptual Schemas of RDFDatasets
Subhi Issa, Pierre-Henri Paris, Fayçal Hamdi, Samira Si-Said Cherfi
CEDRIC LabConservatoire National des Arts et Métiers, Paris, France
June 6th, 2019
Introduction Approach Prototype Conclusion
Context
Several datasets publishedaccording to the Linked Dataprinciples
DBpediaYAGOWikidata
Data is from a variety of sources and of varying quality andthus not easy to trust
There is a real need for developing suitable methods andtechniques to better exploit web dataConceptual modeling is widely recognized as a mean forabstraction and semantics highlighting
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 2 / 25
Introduction Approach Prototype Conclusion
Conceptual modeling and Linked Open Data
Conceptual ModelingInitiated by business needsControlled by modelersExplicit ”user”requirementsQuality is controlled bythe design process
Linked Open Data contextInitiated by user’s moodsDriven by dataInexistant or hidenintentionsQuality is not a subject
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 3 / 25
Introduction Approach Prototype Conclusion
Motivating Example
Example
SELECT * WHERE {?actor rdf:type dbo:Scientist .?actor foaf:name ?name .?actor dbo:birthDate ?birthDate .?actor dbo:birthPlace ?birthPlace .}
Needs to know properties of Scientist (Schema)Brings only scientists having values for all properties(completeness)
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 4 / 25
Introduction Approach Prototype Conclusion
What is completeness?
In relational databases : percentage of non null values
Consequently, we need a reference schema
A reference scientist schema (ontology) could be:
Scientist_Schema = {Properties on Scientist} ∪{Properties on Person} ∪ {Properties on Agent} ∪{Properties on Thing}
such that: Scientist v Person v Agent v Thing
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 5 / 25
Introduction Approach Prototype Conclusion
What is completeness?
In relational databases : percentage of non null valuesConsequently, we need a reference schema
A reference scientist schema (ontology) could be:
Scientist_Schema = {Properties on Scientist} ∪{Properties on Person} ∪ {Properties on Agent} ∪{Properties on Thing}
such that: Scientist v Person v Agent v Thing
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 5 / 25
Introduction Approach Prototype Conclusion
What is completeness?
In relational databases : percentage of non null valuesConsequently, we need a reference schema
A reference scientist schema (ontology) could be:
Scientist_Schema = {Properties on Scientist} ∪{Properties on Person} ∪ {Properties on Agent} ∪{Properties on Thing}
such that: Scientist v Person v Agent v Thing
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 5 / 25
Introduction Approach Prototype Conclusion
What is completeness?
In relational databases : percentage of non null valuesConsequently, we need a reference schema
A reference scientist schema (ontology) could be:
Scientist_Schema = {Properties on Scientist} ∪{Properties on Person} ∪ {Properties on Agent} ∪{Properties on Thing}
such that: Scientist v Person v Agent v Thing
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 5 / 25
Introduction Approach Prototype Conclusion
Comp(Albert_Einstein) =|Properties on Albert_Einstein|
|Scientist_Schema|
=21664
= 4, 21%
Is this schema relevant?
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 6 / 25
Introduction Approach Prototype Conclusion
Comp(Albert_Einstein) =|Properties on Albert_Einstein|
|Scientist_Schema|
=21664
= 4, 21%
Is this schema relevant?
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 6 / 25
Introduction Approach Prototype Conclusion
The approach overview: a scratch card like
Goal
An approach guiding the mining of a schema meeting userrequirements
A completeness based explorationAn incremental processConsidering user requirements (the human in the loop)
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 7 / 25
Introduction Approach Prototype Conclusion
Conceptual schemas derivation
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 8 / 25
Introduction Approach Prototype Conclusion
Completeness Specification
D a triple (C , IC ,P), where:C the set of categories (e.g., Actor, City)IC the set of instances for categories in CP = {p1, p2, ..., pn} the set of properties (e.gresidence(Person,Place))
T = {t1, t2, ..., tm} a set of transactions with:∀k , 1 ≤ k ≤ m : tk ⊆ P vector of transactions over PE (tk) be the set of items in transaction tkEach transaction is a set of properties used in the descriptionof the instances of the subset I ′ = {i1, i2, ..., im} with I ′ ⊆ IC
Completeness
We consider CP the completeness of I ′ against properties used inthe description of each of its instances
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 9 / 25
Introduction Approach Prototype Conclusion
Example
Subject Predicate ObjectThe_Godfather director Francis_Ford_CoppolaThe_Godfather musicComposer Nino_Rota
Goodfellas director Martin_ScorseseGoodfellas editing Thelma_SchoonmakerTrue_Lies director James_CameronTrue_Lies editing Conrad_Buff_IVTrue_Lies musicComposer Brad_Fiedel
Instance TransactionThe_Godfather director, musicComposerGoodfellas director, editingTrue_Lies director, editing, musicComposer
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 10 / 25
Introduction Approach Prototype Conclusion Properties mining Completeness calculation
The Mining-based Approach
The Mining-based Approach includes two steps:
1 Properties mining: Apply the well known FP-growthalgorithm for mining maximal frequent itemsetsMFP
2 Completeness calculation: Use the apparition frequency ofitems (properties) inMFP, to give each of them a weightand calculate the completeness of each transaction (regardingthe presence or absence of properties)
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 11 / 25
Introduction Approach Prototype Conclusion Properties mining Completeness calculation
Properties mining
Given D(C , IC ,P) and I ′ a subset of instances with I ′ ⊆ IC1 Initialize T = φ,MFP = φ
2 For each i ∈ I ′ we generate a transaction t
3 Compute the set of frequent patterns FP from the transactionvector T .
Definition (Pattern)
Let T be a set of transactions. A pattern P̂ is a sequence ofproperties shared by one or several transactions t in T .
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 12 / 25
Introduction Approach Prototype Conclusion Properties mining Completeness calculation
4 Use the FP-growth algorithm to generate the frequent patternsFP
5 Reduce the set of FP by generating a subset containing only"Maximal" patterns
Definition (MFP)
Let P̂ be a frequent pattern. P̂ is maximal if none of its propersuperset is frequent. We define the set of Maximal FrequentPatternsMFP as:
MFP = {P̂ ∈ FP | ∀P̂ ′ ) P̂ :|T (P̂ ′)||T |
< ξ}
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 13 / 25
Introduction Approach Prototype Conclusion Properties mining Completeness calculation
Example
Instance TransactionThe_Godfather director, musicComposerGoodfellas director, editingTrue_Lies director, editing, musicComposer
Let ξ = 60% and the set of frequent patterns
FP = {{director}, {musicComposer}, {editing}, {director ,musicComposer},{director , editing}}
The MFP set would be:
MFP = {{director ,musicComposer}, {director , editing}}
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 14 / 25
Introduction Approach Prototype Conclusion Properties mining Completeness calculation
Completeness calculation
Given theMFP set:
Definition (Completeness CP)Let I ′ be a subset of instances, T the set of transactions constructedfrom I ′, andMFP a set of maximal frequent pattern. Thecompleteness of I ′ corresponds to the completeness of its transactionvector T obtained by calculating the average of the completeness of Tregarding each pattern inMFP. Therefore, we define the completenessCP of a subset of instances I ′ as follows:
CP(I ′) = 1|T |
|T |∑k=1
|MFP|∑j=1
δ(E (tk), P̂j)
|MFP|(1)
such that: P̂j ∈MFP, and
δ(E (tk), P̂j) =
{1 if P̂j ⊂ E (tk)0 otherwise
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 15 / 25
Introduction Approach Prototype Conclusion Properties mining Completeness calculation
Example
LetMFP = {{director ,musicComposer}, {director , editing}}where both itemsets have a support of 60%.The The completeness would be:
CP(I ′) = (2 ∗ (1/2) + (2/2))/3 = 0.67
we have 3 transactions The_Godfather, Goodfellas andTrue_Lies and 2 MFPmore frequent properties have a higer wheight (director)only co-occurente properties are considered
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 16 / 25
Introduction Approach Prototype Conclusion Properties mining Completeness calculation
Algorithm 1 Completeness calculation
Input: D, I ′, ξOutput: CP(I ′)for each
doti =
∣∣p1 p2 . . . pn∣∣
T = T + ti. Properties miningMFP = Maximal(FP-growth(T , ξ)). Using equation 1return CP(I ′) = CalculateCompleteness(I ′, T ,MFP)
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 17 / 25
Introduction Approach Prototype Conclusion
LOD-CM web service
Browse dataset without examining data in detailChoose the dataset that will be most suitable for its intendeduseFacilitate data browsing (based on user requirements):
Inheritance relationshipRelations between classesCompleteness value of each property
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 18 / 25
Introduction Approach Prototype Conclusion
Experimental setup
DBpedia version 2016-10English edition1.1 billion RDF triples468 classes1378 properties
Data HDT dumpsImplemented in JAVA using Jena libraryPlantUML tool to visualize diagrams
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 19 / 25
Introduction Approach Prototype Conclusion
LOD-CMUser interface: http://cedric.cnam.fr/lod-cm
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 20 / 25
Introduction Approach Prototype Conclusion
LOD-CMExample: Class name: Film, Completeness: 50%
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 21 / 25
Introduction Approach Prototype Conclusion
LOD-CMExample: Class name: Film, Completeness: 50% - First iteration
Properties and relationshipswith Completeness > 50%For the next step: zoom inon Artist
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 22 / 25
Introduction Approach Prototype Conclusion
LOD-CMExample: Class name: Film, Completeness: 50% - First iteration
Properties and relationshipswith Completeness > 50%For the next step: zoom inon Artist
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 22 / 25
Introduction Approach Prototype Conclusion
LOD-CMExample: Class name: Artist, Completeness: 50%
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 23 / 25
Introduction Approach Prototype Conclusion
Conclusion
Reveals conceptual schemas from RDF data sourcesConsiders user-specified threshold and exploration requirementsProvides enriched schemas with completeness values
PerspectivesInvestigate the effectiveness of the approach against additionalLinked Open Data datasets such as YAGO, Wikidata, etc.Allow the user to compare conceptual schemas from differentdatasetsExtend user requirements : extra quality criteria, user queries...
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 24 / 25
Introduction Approach Prototype Conclusion
Thank You!Questions?
S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 25 / 25