sampld, structural properties as proxy for semantic relevance
DESCRIPTION
Presentation at the International Semantic Web Conference (ISWC) 2014. See http://presentations.laurensrietveld.nl/now for interactive and better looking version. Abstract: The Linked Data cloud has grown to become the largest knowledge base ever constructed. Its size is now turning into a major bottleneck for many applications. In order to facilitate access to this structured information, this paper proposes an automatic sampling method targeted at maximizing answer coverage for applications using SPARQL querying. The approach presented in this paper is novel: no similar RDF sampling approach exist. Additionally, the concept of creating a sample aimed at maximizing SPARQL answer coverage, is unique. We empirically show that the relevance of triples for sampling (a semantic notion) is influenced by the topology of the graph (purely structural), and can be determined without prior knowledge of the queries. Experiments show a significantly higher recall of topology based sampling methods over random and naive baseline approaches (e.g. up to 90% for Open-BioMed at a sample size of 6%).TRANSCRIPT
SAMPLDSTRUCTURAL PROPERTIES AS PROXY FOR
SEMANTIC RELEVANCELAURENS RIETVELD
HTTP://PRESENTATIONS.LAURENSRIETVELD.NL/NOW
QUANTITY OVER QUALITYDatasets become too large to run on commodity hardwareWe use only a small portionCan't we extract the part we are interested in?
Dataset #triples #queries coverage
DBpedia 459M 1640 0.003%
Linked Geo Data 289M 891 1.917%
MetaLex 204M 4933 0.016%
Open-BioMed 79M 931 0.011%
BIO2RDF (KEGG) 50M 1297 2.013%
Semantic Web Dog
Food
0.24M 193 62.4%
RELEVANCE BASED SAMPLINGFind the smallest possible RDF subgraph, that covers themaximum number of potential queries
How can we determine which triples are relevant, and whichare not?Can we implement a scalable sampling pipeline?Can we evaluate the results in a scalable fashion?
HOW TO DETERMINE RELEVANCE OF TRIPLESINFORMED SAMPLINGWe know exactlywhich queries will beaskedExtract those triplesneeded to answer thequeriesProblem: only a limitednumber of queriesknown
UNINFORMED SAMPLINGWe do not know which queries willbe askedUse information contained in thegraph to determine relevanceRank triples by relevance, and selectthe k best triples (0 < k < size ofgraph)
APPROACHUse the topology of the graph to determine relevance(network analysis)Evaluate the relevance of our samples against the queries thatwe do knowIs network structure a good predictor for query answerability?
NETWORK ANALYSISExample: Explain real-world phenomenonsFind central parts of the graphBetweenness CentralityGoogle PageRank
We applyIn DegreeOut DegreePageRank
RANKED LIST OF TRIPLESRewrite
Apply network analysisNodes Weights → Triples Weights
Weighted list of triples → Sample
W(triple) = max(W(sub), W(obj))
EVALUATION
Sample sizes: 1% - 99%Baselines:
Random Sample (10x)Resource Frequency
NAIVE EVALUATION DOES NOT SCALE
( ) = ⋅ method ⋅ method ⋅te td ∑i=1
99i
100 ss sb td
Over 15.000 datasets, and over 1.4 trillion triplesRequirements
Fast loading of samplesPowerful hardware
Not Scalable: load all triples, execute queries, and calculaterecall
SCALABLE APPROACHRetrieve which triples are used by a queryUse a hadoop cluster to find the weights of these triplesAnalyze whether these triples would have been included in thesample
Scalable. Only execute each query once
EXAMPLESELECT ?person ?country WHERE { ?person :bornIn ?city . ?city :capitalOf ?country .}
1234
?person ?country:Laurens :NL
:Stefan :Germany
QUERY
DATASETSubject Predicate Object Weight:Amsterdam :capitalOf :NL 0.1
:Berlin :capitalOf :Germany 0.5
:Laurens :bornIn :Amsterdam 0.6
:Rinke :bornIn :Heerenveen 0.1
:Stefan :bornIn :Berlin 0.5
QUERY RESULTS → TRIPLESSELECT ?person ?country WHERE { ?person :bornIn ?city . ?city :capitalOf ?country .}
1234
SELECT DISTINCT * WHERE { ?person :bornIn ?city . ?city :capitalOf ?country .}
1234
Subject Predicate Object ?person ?city ?country:Laurens :bornIn :Amsterdam :Laurens :Amsterdam :NL:Amsterdam :capitalOf :NL :Stefan :bornIn :Berlin :Stefan :Berlin :Germany:Berlin :capitalOf :Germany
TRIPLES→RECALLWHICH ANSWERS WOULD WE GET WITH A SAMPLE OF 60%?
Dataset
Subject Predicate Object Weight
:Laurens :bornIn :Amsterdam 0.6
:Stefan :bornIn :Berlin 0.5
:Berlin :capitalOf :Germany 0.5
:Amsterdam :capitalOf :NL 0.1
:Rinke :bornIn :Heerenveen 0.1
Triples used in query resultsets
Subject Predicate Object
:Laurens :bornIn :Amsterdam
:Stefan :bornIn :Berlin
:Berlin :capitalOf :Germany
:Amsterdam :capitalOf :NL
EVALUATIONBetter specificity than regular recallScalable: PIG instead of SPARQLSpecial cases, e.g. GROUP BY, LIMIT, DISTINCT, OPTIONAL,UNION
SPECIAL CASE: UNIONSSELECT ?name WHERE { {:laurens rdfs:label ?name} UNION {:laurens foaf:name ?name}}
12345
SELECT DISTINCT * WHERE { {:laurens rdfs:label ?name} UNION {:laurens foaf:name ?name}}
12345
Subject Predicate Object:Laurens rdfs:label "Laurens":Laurens foaf:name "Laurens"
EVALUATION DATASETSDataset #triples #queries coverage
DBpedia 459M 1640 0.003%
Linked Geo Data 289M 891 1.917%
MetaLex 204M 4933 0.016%
Open-BioMed 79M 931 0.011%
BIO2RDF (KEGG) 50M 1297 2.013%
Semantic Web Dog
Food
0.24M 193 62.4%
RESULTS click for more results
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Sample Size0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0Recall BIO2RDF UniqueLiterals Outdegree
DBpedia Path Pagerank
LinkedGeoData UniqueLiterals Outdegree
Metalex ResourceFrequency
OpenBioMed WithoutLiterals Outdegree
SemanticWebDogFood Path Pagerank
OBSERVATIONSThe influence of a single tripleDBpedia
Good: Path + PageRankBad: Path + Out DegreeQueries: 2/3 require literals
Other Observations# properties vs 'Context Literals' rewrite method# query triple patterns
CONCLUSIONScalable pipeline: network analysis algorithms + rewritemethodsAble to eval over 15.000 datasets, and 1.4 trillion triplesNumber of query sets too limited to learn significantcorrelations
Topology of the graphs can be used to determine good samplesMimic semantic relevance through structural properties,without an a-priori notion of relevance
Special thanks to Semantic Web Science Association (SWSA)