of sampling and smoothing: approximating distributions over linked open data
DESCRIPTION
Talk at the PROFILES 2014 workshop (co-located with ESWC) on sampling RDF graphs and smoothing techniques for estimating data distributionsTRANSCRIPT
Institute for Web Science & Technologies – WeST
Of Sampling and Smoothing: Approximating Distributions over
Linked Open Data
Thomas Gottron
May 26th, 2014
PROFILES Workshop, Crete
Thomas Gottron PROFILES 26.5.2014, 2Approximating Distributions over LOD
Distributions over Linked Data
Probability to observe a certain pattern k
foaf:knows
Predicates
foaf:Personrdf:type
RDF class types
sioc
:follo
ws
?x foaf:knows
rdfs:label
Property Sets
rdf:t
ype
?y foaf:Person
dbpedia:Actor
rdf:type
Type Sets rdf:t
ype
?z
dbpedia:Actor
foaf:knows
foaf:name
ECS
Thomas Gottron PROFILES 26.5.2014, 3Approximating Distributions over LOD
Distributions over Linked Data
Effectively: Estimate a distribution over pattern instances ki
Applications: Query federation Data Mining Schema inferencing
k1 k2 knk3 ...
Thomas Gottron PROFILES 26.5.2014, 4Approximating Distributions over LOD
Distributions over Linked Data
Using entire LOD cloud becomes less and less feasible Solution:
Operate on a sample
Challenges: How to sample? How to deal with unobserved
instances of a pattern?
k1 k2 knk3 ...
Only an approximation!
Thomas Gottron PROFILES 26.5.2014, 5Approximating Distributions over LOD
Sampling Linked Open Data
Thomas Gottron PROFILES 26.5.2014, 6Approximating Distributions over LOD
Data Format
Linked Data as N-Quads:
triple – what is the information?
context URI – where does it come
from?
s op
c
( )s op c
Thomas Gottron PROFILES 26.5.2014, 7Approximating Distributions over LOD
Sampling Strategies
Triple (Edge) Based Sampling
Unique Subject URI (Node) Based Sampling
Context Based Sampling
For all sampling approaches: Unbiased sampling based on uniform distribution
s op
s
c
Thomas Gottron PROFILES 26.5.2014, 8Approximating Distributions over LOD
Smoothing Distributions
Thomas Gottron PROFILES 26.5.2014, 9Approximating Distributions over LOD
Obtaining a Distribution from an Index
k1
k2
k3
...
kn
d1,1 d1,2 d1,3 ...
d2,1 d2,2
d3,1 d3,2 d3,3 ...
dn,1 dn,2 dn,3 ...
https://github.com/gottron/lod-index-models
Thomas Gottron PROFILES 26.5.2014, 10Approximating Distributions over LOD
Obtaining a Distribution from an Index
k1
k2
k3
...
kn
4
2
10
8
Relative frequencies
...
Thomas Gottron PROFILES 26.5.2014, 11Approximating Distributions over LOD
Unobserved patterns!
Unobserved pattern instance (e.g. predicate, type sets)
Adjusted relative frequencies
k1
k2
k3
...
kn
4
2
10
8
<new> 0
...
+ λ
+ λ
+ λ
+ λ
+ λ
P(<new>) = 0
P(<new>) > 0
Thomas Gottron PROFILES 26.5.2014, 12Approximating Distributions over LOD
Unobserved patterns!
Unobserved pattern instance (e.g. predicate, type sets)
Lidstone-Smoothing with parameter λ Laplace-Smoothing (Add-One) for λ = 1
k1
k2
k3
...
kn
4
2
10
8
<new> 0
...
+ λ
+ λ
+ λ
+ λ
+ λ
Thomas Gottron PROFILES 26.5.2014, 13Approximating Distributions over LOD
Evaluation
Thomas Gottron PROFILES 26.5.2014, 14Approximating Distributions over LOD
Experimental Evaluation
Obtain different distributions based on: Sampling:
• Strategy (triple, USU, context)• Rate: (5% - 90%)
Smoothing:• Laplace• Lidstone with λ = 0.5, λ = 0.1 and λ = 0.01
Compare to full data set 10 iterations
Dynamic Linked Data Observatory
Weekly snapshots, 16M triples(only first snapshot used here)
Thomas Gottron PROFILES 26.5.2014, 15Approximating Distributions over LOD
Comparing Distributions
Information theoretic measure for comparing distributions:
???
Cross-Entropy of P and Q
Kullback-Leibler Divergence
Thomas Gottron PROFILES 26.5.2014, 16Approximating Distributions over LOD
Experimental Setup
Index construction / Estimation of distributions
...
...
5% 10% 20% 30% Full (100%)
...
90%
5%
„dev
iatio
n“
10% 20% 30% 100%90%
Thomas Gottron PROFILES 26.5.2014, 17Approximating Distributions over LOD
RDF class typesPredicates
Impact of Sampling Strategy
Property sets Type sets
ECS similar
Thomas Gottron PROFILES 26.5.2014, 18Approximating Distributions over LOD
Impact of SmoothingPredicates, context
samplingPredicates, triple sampling
ECS, context sampling ECS, USU sampling
Thomas Gottron PROFILES 26.5.2014, 19Approximating Distributions over LOD
Conclusion
Summary
Baseline for sampling and smoothing techniques Little difference between classical smoothing techniques Quality of context-based sampling as realistic scenario Other samplings suitable for generating VoID descriptions
Future Work
Smarter smoothing techniques Inspired by Language Modelling Specific for LOD
Thomas Gottron PROFILES 26.5.2014, 20Approximating Distributions over LOD
Thanks!
Contact:Thomas Gottron
WeST – Institute for Web Science and Technologies
Universität Koblenz-Landau