sampld, structural properties as proxy for semantic relevance

SAMPLDSTRUCTURAL PROPERTIES AS PROXY FOR

SEMANTIC RELEVANCELAURENS RIETVELD

HTTP://PRESENTATIONS.LAURENSRIETVELD.NL/NOW

http://presentations.laurensrietveld.nl/now

QUANTITY OVER QUALITYDatasets become too large to run on commodity hardwareWe use only a small portionCan't we extract the part we are interested in?

Dataset #triples #queries coverage

DBpedia 459M 1640 0.003%

Linked Geo Data 289M 891 1.917%

MetaLex 204M 4933 0.016%

Open-BioMed 79M 931 0.011%

BIO2RDF (KEGG) 50M 1297 2.013%

Semantic Web Dog

Food

0.24M 193 62.4%

RELEVANCE BASED SAMPLINGFind the smallest possible RDF subgraph, that covers themaximum number of potential queries

How can we determine which triples are relevant, and whichare not?Can we implement a scalable sampling pipeline?Can we evaluate the results in a scalable fashion?

HOW TO DETERMINE RELEVANCE OF TRIPLESINFORMED SAMPLINGWe know exactlywhich queries will beaskedExtract those triplesneeded to answer thequeriesProblem: only a limitednumber of queriesknown

UNINFORMED SAMPLINGWe do not know which queries willbe askedUse information contained in thegraph to determine relevanceRank triples by relevance, and selectthe k best triples (0 < k < size ofgraph)

APPROACHUse the topology of the graph to determine relevance(network analysis)Evaluate the relevance of our samples against the queries thatwe do knowIs network structure a good predictor for query answerability?

NETWORK ANALYSISExample: Explain real-world phenomenonsFind central parts of the graphBetweenness CentralityGoogle PageRank

We applyIn DegreeOut DegreePageRank

RANKED LIST OF TRIPLESRewrite

Apply network analysisNodes Weights → Triples Weights

Weighted list of triples → Sample

W(triple) = max(W(sub), W(obj))

EVALUATION

Sample sizes: 1% - 99%Baselines:

Random Sample (10x)Resource Frequency

NAIVE EVALUATION DOES NOT SCALE

( ) = ⋅ method ⋅ method ⋅te td ∑i=1

99i

100 ss sb td

Over 15.000 datasets, and over 1.4 trillion triplesRequirements

Fast loading of samplesPowerful hardware

Not Scalable: load all triples, execute queries, and calculaterecall

SCALABLE APPROACHRetrieve which triples are used by a queryUse a hadoop cluster to find the weights of these triplesAnalyze whether these triples would have been included in thesample

Scalable. Only execute each query once

EXAMPLESELECT ?person ?country WHERE { ?person :bornIn ?city . ?city :capitalOf ?country .}

1234

?person ?country:Laurens :NL

:Stefan :Germany

QUERY

DATASETSubject Predicate Object Weight:Amsterdam :capitalOf :NL 0.1

:Berlin :capitalOf :Germany 0.5

:Laurens :bornIn :Amsterdam 0.6

:Rinke :bornIn :Heerenveen 0.1

:Stefan :bornIn :Berlin 0.5

QUERY RESULTS → TRIPLESSELECT ?person ?country WHERE { ?person :bornIn ?city . ?city :capitalOf ?country .}

1234

SELECT DISTINCT * WHERE { ?person :bornIn ?city . ?city :capitalOf ?country .}

1234

Subject Predicate Object ?person ?city ?country:Laurens :bornIn :Amsterdam :Laurens :Amsterdam :NL:Amsterdam :capitalOf :NL :Stefan :bornIn :Berlin :Stefan :Berlin :Germany:Berlin :capitalOf :Germany

TRIPLES→RECALLWHICH ANSWERS WOULD WE GET WITH A SAMPLE OF 60%?

Dataset

Subject Predicate Object Weight

:Laurens :bornIn :Amsterdam 0.6

:Stefan :bornIn :Berlin 0.5

:Berlin :capitalOf :Germany 0.5

:Amsterdam :capitalOf :NL 0.1

:Rinke :bornIn :Heerenveen 0.1

Triples used in query resultsets

Subject Predicate Object

:Laurens :bornIn :Amsterdam

:Stefan :bornIn :Berlin

:Berlin :capitalOf :Germany

:Amsterdam :capitalOf :NL

EVALUATIONBetter specificity than regular recallScalable: PIG instead of SPARQLSpecial cases, e.g. GROUP BY, LIMIT, DISTINCT, OPTIONAL,UNION

SPECIAL CASE: UNIONSSELECT ?name WHERE { {:laurens rdfs:label ?name} UNION {:laurens foaf:name ?name}}

12345

SELECT DISTINCT * WHERE { {:laurens rdfs:label ?name} UNION {:laurens foaf:name ?name}}

12345

Subject Predicate Object:Laurens rdfs:label "Laurens":Laurens foaf:name "Laurens"

EVALUATION DATASETSDataset #triples #queries coverage

DBpedia 459M 1640 0.003%

Linked Geo Data 289M 891 1.917%

MetaLex 204M 4933 0.016%

Open-BioMed 79M 931 0.011%

BIO2RDF (KEGG) 50M 1297 2.013%

Semantic Web Dog

Food

0.24M 193 62.4%

RESULTS click for more results

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Sample Size0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0Recall BIO2RDF UniqueLiterals Outdegree

DBpedia Path Pagerank

LinkedGeoData UniqueLiterals Outdegree

Metalex ResourceFrequency

OpenBioMed WithoutLiterals Outdegree

SemanticWebDogFood Path Pagerank

OBSERVATIONSThe influence of a single tripleDBpedia

Good: Path + PageRankBad: Path + Out DegreeQueries: 2/3 require literals

Other Observations# properties vs 'Context Literals' rewrite method# query triple patterns

CONCLUSIONScalable pipeline: network analysis algorithms + rewritemethodsAble to eval over 15.000 datasets, and 1.4 trillion triplesNumber of query sets too limited to learn significantcorrelations

Topology of the graphs can be used to determine good samplesMimic semantic relevance through structural properties,without an a-priori notion of relevance

Special thanks to Semantic Web Science Association (SWSA)