volt - eswc 2016

33
volt: A Provenance-Producing, Transparent SPARQL Proxy for the On-Demand Computation of Linked Data & its Application to Spatiotemporally Dependent Data Blake Regalia, Krzysztof Janowicz, and Song Gao June 2, 2016 - ESWC 2016 STKO Lab University of California, Santa Barbara, CA, USA 0

Upload: blake-regalia

Post on 19-Feb-2017

191 views

Category:

Software


0 download

TRANSCRIPT

Page 1: VOLT - ESWC 2016

volt:A Provenance-Producing, Transparent SPARQL Proxyfor the On-Demand Computation of Linked Data &its Application to Spatiotemporally Dependent Data

Blake Regalia, Krzysztof Janowicz, and Song Gao

June 2, 2016 - ESWC 2016

STKO LabUniversity of California, Santa Barbara, CA, USA

0

Page 2: VOLT - ESWC 2016

motivation

Page 3: VOLT - ESWC 2016

linked data

Linked Data has successfully provided methods and tools that ease thepublication, retrieval, sharing, reuse and integration of rich data acrossheterogeneous sources on the web.

For these reasons, we have seen rapid increase of data sources in LinkedOpen Data as well as an uptake of the involved technologies byorganizations in academia, governments and industry.

2

Page 4: VOLT - ESWC 2016

problem

However, there are still several hurdles that are challenging dataconsumers from using and applying Linked Data at its full potential.

We believe that these key issues need to be addressed:

▶ data quality, coverage and longevity▶ background knowledge needed to query distant data▶ reproducibility of query results and their derived findings▶ lack of accessible computational capabilities

3

Page 5: VOLT - ESWC 2016

solution

To address these issues, we propose a computational framework, VOLT,VOLT Ontology and Linked-data Technology, and its proxy.

In this presentation, we:

1. Illustrate the need for computation in Linked Data2. Introduce the VOLT framework3. Explain how the VOLT proxy works4. Examine a case study5. Demonstrate how the framework generalizes

4

Page 6: VOLT - ESWC 2016

need for computation

Page 7: VOLT - ESWC 2016

dependent data

How can we use the population density of a place?

Area = PopulationDensity = 8491079 ppl

10755.995322 pplkm2

≈ 789.4275 km2

6

Page 8: VOLT - ESWC 2016

dependent data

So, population density reflects dbo:areaLand?

Area = PopulationDensity = 446007 ppl

1100 pplkm

≈ 405.461 km

7

Page 9: VOLT - ESWC 2016

filter out inconsistencies

select ?place ?density ?landAreaErrorKm ?totalAreaErrorKm ?errorMargin ?closerAreaProperty {# triple patterns?place dbo:populationDensity ?density ;

dbo:populationTotal ?population ;dbo:areaTotal ?totalArea ;dbo:areaLand ?landArea .

# avoid division by zero, ignore bad valuesfilter(?density != 0 && ?landArea != 0 && ?totalArea != 0 && ?population !=0)# no duplicationsfilter not exists { ?place dbo:populationDensity ?wd . filter(?density != ?wd) }filter not exists { ?place dbo:populationTotal ?wp . filter(?population != ?wp) }filter not exists { ?place dbo:areaLand ?wla. filter(?landArea != ?wla) }# calculate expected areabind(?population / ?density as ?expectedAreaKm)# convert given area values to km unitsbind(?landArea / 1000000 as ?landAreaKm)bind(?totalArea / 1000000 as ?totalAreaKm)# compute amount of error in each area propertybind(abs(?landAreaKm - ?expectedAreaKm) as ?landAreaErrorKm)bind(abs(?totalAreaKm - ?expectedAreaKm) as ?totalAreaErrorKm)# only show places that have less area towards wrong propertyfilter(?totalAreaErrorKm > ?landAreaErrorKm)# bind closer area property by which has smaller errorbind(if(?landAreaErrorKm < ?totalAreaErrorKm, dbo:areaTotal,dbo:areaLand) as ?closerAreaProperty)# compute difference among errorsbind(?landAreaErrorKm - ?totalAreaErrorKm as ?errorMargin)# set closer area property value?place ?closerAreaProperty ?closerAreaValue .# only show those where the error is less than a fraction of closer area valuefilter(?errorMargin < ?closerAreaProperty / 10)

} order by desc(?errorMargin)

8

Page 10: VOLT - ESWC 2016

... or, just compute it

select ?place (?population / ?area as ?density) {?place dbo:areaLand ?area ;

dbo:populationTotal ?population .filter(?area != 0)

}

Its more reliable to derive your own population density value on-the-fly

A Linked-Data consumer expects this property to be reliabledbo:populationDensity ... but it is inconsistent

?density := ?population / ?area

The nature of population density property is that its value is derived fromother data, so why not just compute it anyway?

9

Page 11: VOLT - ESWC 2016

framework

Page 12: VOLT - ESWC 2016

framework don’ts

Some things to avoid:

1. requiring data providers to adopt new software2. not revealing source code of rules to end-user3. deviating from W3C standards or “reinventing the wheel”

How can we aid the computation of dependent data without violating thephilosophies of Linked Open Data?

11

Page 13: VOLT - ESWC 2016

framework ideals

To encourage adoption of our framework, we want to:

▶ operate ad-hoc, without requiring data providers to mutate▶ be fully transparent about what is being done to data by keeping

everything openly available for inspection▶ conform to existing W3C standards and maintain interoperability

How do we seamlessly integrate an extendable computational engine intothe Semantic Web Layer Cake?

12

Page 14: VOLT - ESWC 2016

transparent proxy

Page 15: VOLT - ESWC 2016

man in the middle... support

We propose a framework that functions as a transparent proxy to anyexisting SPARQL 1.1 endpoint.

The layers of VOLT

14

Page 16: VOLT - ESWC 2016

sparql as an api

Take advantage of existing SPARQL grammar to create an API

By using this format,

▶ end-user writes normal SPARQL query▶ these syntactic patterns match their materialized form▶ the same query can reused elsewhere

15

Page 17: VOLT - ESWC 2016

the volt ontology

The VOLT Ontology serializes program logic in RDF

...a volt:IfThenElse ;volt:if [

a volt:Operation ;volt:operator "<"^^volt:Operator ;volt:lhs "?lower"^^volt:Variable ;volt:rhs 0 ;

] ;volt:then (

[a volt:Assignment ;volt:assign [

volt:variable "?lower"^^volt:Variable ;volt:operator "+="^^volt:Operator ;volt:expression 6.283185307179586 ;

] ;][

a volt:Yield ;volt:expression [ ... ]

]) ;

...

16

Page 18: VOLT - ESWC 2016

transparency of procedures

describe ?procedure {graph volt:graphs { ?modelGraph a volt:ModelGraph }graph ?modelGraph {

?procedure rdf:type/rdfs:subClassOf volt:Procedure .?procedure (!</>)+ geo:geometry .

}}

Source of procedures remains open and readily accessible

Client may use that capability to:

▶ search for procedures that match some criteria▶ inspect a procedure to understand its assumptions▶ copy/modify/redistribute procedures from data providers

17

Page 19: VOLT - ESWC 2016

reproducibility

Procedures are only invoked if the triple in question does not already exist

▶ Caching spares computation▶ Provenance ensures reproducability and invalidation of stale cache

18

Page 20: VOLT - ESWC 2016

Cardinal Directions

19

Page 21: VOLT - ESWC 2016

diversity

20

Page 22: VOLT - ESWC 2016

statistics

• 1.15 million places1 on DBpedia2

• ~3.2% of them (36.7k) take part in cardinal direction relations

• 138.8k cardinal direction triples on DBpedia in total

1Individuals that are dbo:Place or have geo:geometry2As of DBpedia 2015-10

21

Page 23: VOLT - ESWC 2016

accuracy

136,964 combinations of geometries3 among places with cardinaldirection relations

Using 8 equal divisions (π4 ) of the compass Nearly 1

3 of all relations are innaccurate

3Formatted in Well-Known Text: Geographic coordinates22

Page 24: VOLT - ESWC 2016

strategy

Enumerating all possible combinations of cardinal direction relationsbetween places with geometries...

(951.2k

2

)> 452 billion triples

Currently only 1.1 billion triples on English DBpedia,or 8.8 billion triples overall (i.e., globally)

23

Page 25: VOLT - ESWC 2016

on-demand computation

We tackle these relations using VOLT, only computing triples on-demand.

24

Page 26: VOLT - ESWC 2016

generalizing

Page 27: VOLT - ESWC 2016

extending volt

The proxy natively handles flow control, scoped variables, operationalexpressions and SPARQL queries.

For more advanced operations, it also supports external systems such asspawning child processes to employ algorithms in libraries, make HTTPrequests, read/write from file system, etc.

26

Page 28: VOLT - ESWC 2016

postgis

For instance, we developed a VOLT plugin that enables users anddevelopers to call the spatial functions found in PostGIS on their data

@prefix postgis: <http://postgis.net/functions/>

postgis:areapostgis:azimuthpostgis:centroidpostgis:closestPointpostgis:clusterWithinpostgis:containspostgis:coverspostgis:coveredBypostgis:crossespostgis:disjoint...

With this set of EVT functions combined with the native capabilities ofthe proxy, we were able to perform complex spatial queries much shorterand simpler than their GeoSPARQL equivalents

27

Page 29: VOLT - ESWC 2016

using geosparql

Suppose we want to compute the sum of populations for counties alongthe coast using dbo:populationTotal, or dbp:populationTotal ifthe former is not valid/available.

With GeoSPARQL we can do this with the following query:# count the population of coastal counties in Californiaselect (sum(?countyPopulation) as ?coastalPopulation) where {data:PacificCoast geo:hasGeometry/geo:asWkt ?pacificCoastWkt .{ select ?county (sample(?population) as ?countyPopulation) {

?county a yago:CaliforniaCounties .?county geo:hasGeometry/geo:asWKT ?countyWkt .filter(regex(?countyWkt, '^(<[^>]*>)?(MULTI)?POLYGON', 'i'))filter(geof:sfTouches(?countyWkt, ?pacificCoastWkt)){ ?county dbo:populationTotal ?population .

filter(isNumeric(?population))} union {?county dbp:populationTotal ?population .filter(isNumeric(?population))filter not exists {

?county dbo:populationTotal ?best_population .filter(isNumeric(?best_population))

}}} group by ?county }

}28

Page 30: VOLT - ESWC 2016

using volt with postgis

Or, we can call specific VOLT procedures to do the heavy lifting for us

# count the population of coastal counties in Californiaselect ?population ?area where {{ select (volt:cluster(?county) as ?setOfCounties) {

?county a yago:CaliforniaCounties .?county stko:along data:PacificCoast . } }

[] stko:sumOfPlaces [input:places ?setOfCounties ;input:propertyList (dbo:populationTotal dbp:populationTotal) ;output:sum ?population ;output:coveredArea ?area ; ]

}

volt:cluster acts as an aggregate function that compiles an RDFcollection to be used by the surrounding outer-select

stko:along tests for adjacency by using PostGIS functions to deal withsliver polygons

stko:sumOfPlaces computes the sum of a given property (or itsfallbacks) among a collection of places with geometries

29

Page 31: VOLT - ESWC 2016

conclusions

Page 32: VOLT - ESWC 2016

recap

▶ Data that are dependent should be computed to improve quality,coverage, and longevity

▶ Some data are better suited for on-demand computation rather thanbeing pre-computed

▶ Here, we explored its use with spatiotemporal data - but VOLT isgeneric, and is equally prepared for any domain

▶ The VOLT proxy integrates seamlessly into existing technology andacts fully transparently

Our aim is to empower end-users’ computational abilities, provide themeans to inspect how computations are made, and track the provenanceof computed data.

31

Page 33: VOLT - ESWC 2016

thank you!

https://github.com/blake-regalia/volt

blake regalia @ gmail

32