sampld, structural properties as proxy for semantic relevance

21
SAMPLD STRUCTURAL PROPERTIES AS PROXY FOR SEMANTIC RELEVANCE LAURENS RIETVELD HTTP://PRESENTATIONS.LAURENSRIETVELD.NL/NOW

Upload: laurensrietveld

Post on 04-Jul-2015

75 views

Category:

Science


1 download

DESCRIPTION

Presentation at the International Semantic Web Conference (ISWC) 2014. See http://presentations.laurensrietveld.nl/now for interactive and better looking version. Abstract: The Linked Data cloud has grown to become the largest knowledge base ever constructed. Its size is now turning into a major bottleneck for many applications. In order to facilitate access to this structured information, this paper proposes an automatic sampling method targeted at maximizing answer coverage for applications using SPARQL querying. The approach presented in this paper is novel: no similar RDF sampling approach exist. Additionally, the concept of creating a sample aimed at maximizing SPARQL answer coverage, is unique. We empirically show that the relevance of triples for sampling (a semantic notion) is influenced by the topology of the graph (purely structural), and can be determined without prior knowledge of the queries. Experiments show a significantly higher recall of topology based sampling methods over random and naive baseline approaches (e.g. up to 90% for Open-BioMed at a sample size of 6%).

TRANSCRIPT

Page 1: SampLD, Structural Properties as Proxy for Semantic Relevance

SAMPLDSTRUCTURAL PROPERTIES AS PROXY FOR

SEMANTIC RELEVANCELAURENS RIETVELD

HTTP://PRESENTATIONS.LAURENSRIETVELD.NL/NOW

Page 2: SampLD, Structural Properties as Proxy for Semantic Relevance

QUANTITY OVER QUALITYDatasets become too large to run on commodity hardwareWe use only a small portionCan't we extract the part we are interested in?

Dataset #triples #queries coverage

DBpedia 459M 1640 0.003%

Linked Geo Data 289M 891 1.917%

MetaLex 204M 4933 0.016%

Open-BioMed 79M 931 0.011%

BIO2RDF (KEGG) 50M 1297 2.013%

Semantic Web Dog

Food

0.24M 193 62.4%

Page 3: SampLD, Structural Properties as Proxy for Semantic Relevance

RELEVANCE BASED SAMPLINGFind the smallest possible RDF subgraph, that covers themaximum number of potential queries

How can we determine which triples are relevant, and whichare not?Can we implement a scalable sampling pipeline?Can we evaluate the results in a scalable fashion?

Page 4: SampLD, Structural Properties as Proxy for Semantic Relevance

HOW TO DETERMINE RELEVANCE OF TRIPLESINFORMED SAMPLINGWe know exactlywhich queries will beaskedExtract those triplesneeded to answer thequeriesProblem: only a limitednumber of queriesknown

UNINFORMED SAMPLINGWe do not know which queries willbe askedUse information contained in thegraph to determine relevanceRank triples by relevance, and selectthe k best triples (0 < k < size ofgraph)

Page 5: SampLD, Structural Properties as Proxy for Semantic Relevance

APPROACHUse the topology of the graph to determine relevance(network analysis)Evaluate the relevance of our samples against the queries thatwe do knowIs network structure a good predictor for query answerability?

Page 6: SampLD, Structural Properties as Proxy for Semantic Relevance

NETWORK ANALYSISExample: Explain real-world phenomenonsFind central parts of the graphBetweenness CentralityGoogle PageRank

We applyIn DegreeOut DegreePageRank

Page 7: SampLD, Structural Properties as Proxy for Semantic Relevance

RANKED LIST OF TRIPLESRewrite

Apply network analysisNodes Weights → Triples Weights

Weighted list of triples → Sample

W(triple) = max(W(sub), W(obj))

Page 8: SampLD, Structural Properties as Proxy for Semantic Relevance

EVALUATION

Sample sizes: 1% - 99%Baselines:

Random Sample (10x)Resource Frequency

Page 9: SampLD, Structural Properties as Proxy for Semantic Relevance

NAIVE EVALUATION DOES NOT SCALE

( ) = ⋅ method ⋅ method ⋅te td ∑i=1

99i

100 ss sb td

Over 15.000 datasets, and over 1.4 trillion triplesRequirements

Fast loading of samplesPowerful hardware

Not Scalable: load all triples, execute queries, and calculaterecall

Page 10: SampLD, Structural Properties as Proxy for Semantic Relevance

SCALABLE APPROACHRetrieve which triples are used by a queryUse a hadoop cluster to find the weights of these triplesAnalyze whether these triples would have been included in thesample

Scalable. Only execute each query once

Page 11: SampLD, Structural Properties as Proxy for Semantic Relevance

EXAMPLESELECT ?person ?country WHERE { ?person :bornIn ?city . ?city :capitalOf ?country .}

1234

?person ?country:Laurens :NL

:Stefan :Germany

QUERY

DATASETSubject Predicate Object Weight:Amsterdam :capitalOf :NL 0.1

:Berlin :capitalOf :Germany 0.5

:Laurens :bornIn :Amsterdam 0.6

:Rinke :bornIn :Heerenveen 0.1

:Stefan :bornIn :Berlin 0.5

Page 12: SampLD, Structural Properties as Proxy for Semantic Relevance

QUERY RESULTS → TRIPLESSELECT ?person ?country WHERE { ?person :bornIn ?city . ?city :capitalOf ?country .}

1234

SELECT DISTINCT * WHERE { ?person :bornIn ?city . ?city :capitalOf ?country .}

1234

Subject Predicate Object ?person ?city ?country:Laurens :bornIn :Amsterdam :Laurens :Amsterdam :NL:Amsterdam :capitalOf :NL :Stefan :bornIn :Berlin :Stefan :Berlin :Germany:Berlin :capitalOf :Germany

Page 13: SampLD, Structural Properties as Proxy for Semantic Relevance

TRIPLES→RECALLWHICH ANSWERS WOULD WE GET WITH A SAMPLE OF 60%?

Dataset

Subject Predicate Object Weight

:Laurens :bornIn :Amsterdam 0.6

:Stefan :bornIn :Berlin 0.5

:Berlin :capitalOf :Germany 0.5

:Amsterdam :capitalOf :NL 0.1

:Rinke :bornIn :Heerenveen 0.1

Triples used in query resultsets

Subject Predicate Object

:Laurens :bornIn :Amsterdam

:Stefan :bornIn :Berlin

:Berlin :capitalOf :Germany

:Amsterdam :capitalOf :NL

Page 14: SampLD, Structural Properties as Proxy for Semantic Relevance

EVALUATIONBetter specificity than regular recallScalable: PIG instead of SPARQLSpecial cases, e.g. GROUP BY, LIMIT, DISTINCT, OPTIONAL,UNION

Page 15: SampLD, Structural Properties as Proxy for Semantic Relevance

SPECIAL CASE: UNIONSSELECT ?name WHERE { {:laurens rdfs:label ?name} UNION {:laurens foaf:name ?name}}

12345

SELECT DISTINCT * WHERE { {:laurens rdfs:label ?name} UNION {:laurens foaf:name ?name}}

12345

Subject Predicate Object:Laurens rdfs:label "Laurens":Laurens foaf:name "Laurens"

Page 16: SampLD, Structural Properties as Proxy for Semantic Relevance
Page 17: SampLD, Structural Properties as Proxy for Semantic Relevance

EVALUATION DATASETSDataset #triples #queries coverage

DBpedia 459M 1640 0.003%

Linked Geo Data 289M 891 1.917%

MetaLex 204M 4933 0.016%

Open-BioMed 79M 931 0.011%

BIO2RDF (KEGG) 50M 1297 2.013%

Semantic Web Dog

Food

0.24M 193 62.4%

Page 18: SampLD, Structural Properties as Proxy for Semantic Relevance
Page 19: SampLD, Structural Properties as Proxy for Semantic Relevance

RESULTS click for more results

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Sample Size0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0Recall BIO2RDF ­ UniqueLiterals ­ Outdegree

DBpedia ­ Path ­ Pagerank

LinkedGeoData ­ UniqueLiterals ­ Outdegree

Metalex ­ ResourceFrequency

OpenBioMed ­ WithoutLiterals ­ Outdegree

SemanticWebDogFood ­ Path ­ Pagerank

Page 20: SampLD, Structural Properties as Proxy for Semantic Relevance

OBSERVATIONSThe influence of a single tripleDBpedia

Good: Path + PageRankBad: Path + Out DegreeQueries: 2/3 require literals

Other Observations# properties vs 'Context Literals' rewrite method# query triple patterns

Page 21: SampLD, Structural Properties as Proxy for Semantic Relevance

CONCLUSIONScalable pipeline: network analysis algorithms + rewritemethodsAble to eval over 15.000 datasets, and 1.4 trillion triplesNumber of query sets too limited to learn significantcorrelations

Topology of the graphs can be used to determine good samplesMimic semantic relevance through structural properties,without an a-priori notion of relevance

Special thanks to Semantic Web Science Association (SWSA)