volt - eswc 2016

Post on 19-Feb-2017

191 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

volt:A Provenance-Producing, Transparent SPARQL Proxyfor the On-Demand Computation of Linked Data &its Application to Spatiotemporally Dependent Data

Blake Regalia, Krzysztof Janowicz, and Song Gao

June 2, 2016 - ESWC 2016

STKO LabUniversity of California, Santa Barbara, CA, USA

0

motivation

linked data

Linked Data has successfully provided methods and tools that ease thepublication, retrieval, sharing, reuse and integration of rich data acrossheterogeneous sources on the web.

For these reasons, we have seen rapid increase of data sources in LinkedOpen Data as well as an uptake of the involved technologies byorganizations in academia, governments and industry.

2

problem

However, there are still several hurdles that are challenging dataconsumers from using and applying Linked Data at its full potential.

We believe that these key issues need to be addressed:

▶ data quality, coverage and longevity▶ background knowledge needed to query distant data▶ reproducibility of query results and their derived findings▶ lack of accessible computational capabilities

3

solution

To address these issues, we propose a computational framework, VOLT,VOLT Ontology and Linked-data Technology, and its proxy.

In this presentation, we:

1. Illustrate the need for computation in Linked Data2. Introduce the VOLT framework3. Explain how the VOLT proxy works4. Examine a case study5. Demonstrate how the framework generalizes

4

need for computation

dependent data

How can we use the population density of a place?

Area = PopulationDensity = 8491079 ppl

10755.995322 pplkm2

≈ 789.4275 km2

6

dependent data

So, population density reflects dbo:areaLand?

Area = PopulationDensity = 446007 ppl

1100 pplkm

≈ 405.461 km

7

filter out inconsistencies

select ?place ?density ?landAreaErrorKm ?totalAreaErrorKm ?errorMargin ?closerAreaProperty {# triple patterns?place dbo:populationDensity ?density ;

dbo:populationTotal ?population ;dbo:areaTotal ?totalArea ;dbo:areaLand ?landArea .

# avoid division by zero, ignore bad valuesfilter(?density != 0 && ?landArea != 0 && ?totalArea != 0 && ?population !=0)# no duplicationsfilter not exists { ?place dbo:populationDensity ?wd . filter(?density != ?wd) }filter not exists { ?place dbo:populationTotal ?wp . filter(?population != ?wp) }filter not exists { ?place dbo:areaLand ?wla. filter(?landArea != ?wla) }# calculate expected areabind(?population / ?density as ?expectedAreaKm)# convert given area values to km unitsbind(?landArea / 1000000 as ?landAreaKm)bind(?totalArea / 1000000 as ?totalAreaKm)# compute amount of error in each area propertybind(abs(?landAreaKm - ?expectedAreaKm) as ?landAreaErrorKm)bind(abs(?totalAreaKm - ?expectedAreaKm) as ?totalAreaErrorKm)# only show places that have less area towards wrong propertyfilter(?totalAreaErrorKm > ?landAreaErrorKm)# bind closer area property by which has smaller errorbind(if(?landAreaErrorKm < ?totalAreaErrorKm, dbo:areaTotal,dbo:areaLand) as ?closerAreaProperty)# compute difference among errorsbind(?landAreaErrorKm - ?totalAreaErrorKm as ?errorMargin)# set closer area property value?place ?closerAreaProperty ?closerAreaValue .# only show those where the error is less than a fraction of closer area valuefilter(?errorMargin < ?closerAreaProperty / 10)

} order by desc(?errorMargin)

8

... or, just compute it

select ?place (?population / ?area as ?density) {?place dbo:areaLand ?area ;

dbo:populationTotal ?population .filter(?area != 0)

}

Its more reliable to derive your own population density value on-the-fly

A Linked-Data consumer expects this property to be reliabledbo:populationDensity ... but it is inconsistent

?density := ?population / ?area

The nature of population density property is that its value is derived fromother data, so why not just compute it anyway?

9

framework

framework don’ts

Some things to avoid:

1. requiring data providers to adopt new software2. not revealing source code of rules to end-user3. deviating from W3C standards or “reinventing the wheel”

How can we aid the computation of dependent data without violating thephilosophies of Linked Open Data?

11

framework ideals

To encourage adoption of our framework, we want to:

▶ operate ad-hoc, without requiring data providers to mutate▶ be fully transparent about what is being done to data by keeping

everything openly available for inspection▶ conform to existing W3C standards and maintain interoperability

How do we seamlessly integrate an extendable computational engine intothe Semantic Web Layer Cake?

12

transparent proxy

man in the middle... support

We propose a framework that functions as a transparent proxy to anyexisting SPARQL 1.1 endpoint.

The layers of VOLT

14

sparql as an api

Take advantage of existing SPARQL grammar to create an API

By using this format,

▶ end-user writes normal SPARQL query▶ these syntactic patterns match their materialized form▶ the same query can reused elsewhere

15

the volt ontology

The VOLT Ontology serializes program logic in RDF

...a volt:IfThenElse ;volt:if [

a volt:Operation ;volt:operator "<"^^volt:Operator ;volt:lhs "?lower"^^volt:Variable ;volt:rhs 0 ;

] ;volt:then (

[a volt:Assignment ;volt:assign [

volt:variable "?lower"^^volt:Variable ;volt:operator "+="^^volt:Operator ;volt:expression 6.283185307179586 ;

] ;][

a volt:Yield ;volt:expression [ ... ]

]) ;

...

16

transparency of procedures

describe ?procedure {graph volt:graphs { ?modelGraph a volt:ModelGraph }graph ?modelGraph {

?procedure rdf:type/rdfs:subClassOf volt:Procedure .?procedure (!</>)+ geo:geometry .

}}

Source of procedures remains open and readily accessible

Client may use that capability to:

▶ search for procedures that match some criteria▶ inspect a procedure to understand its assumptions▶ copy/modify/redistribute procedures from data providers

17

reproducibility

Procedures are only invoked if the triple in question does not already exist

▶ Caching spares computation▶ Provenance ensures reproducability and invalidation of stale cache

18

Cardinal Directions

19

diversity

20

statistics

• 1.15 million places1 on DBpedia2

• ~3.2% of them (36.7k) take part in cardinal direction relations

• 138.8k cardinal direction triples on DBpedia in total

1Individuals that are dbo:Place or have geo:geometry2As of DBpedia 2015-10

21

accuracy

136,964 combinations of geometries3 among places with cardinaldirection relations

Using 8 equal divisions (π4 ) of the compass Nearly 1

3 of all relations are innaccurate

3Formatted in Well-Known Text: Geographic coordinates22

strategy

Enumerating all possible combinations of cardinal direction relationsbetween places with geometries...

(951.2k

2

)> 452 billion triples

Currently only 1.1 billion triples on English DBpedia,or 8.8 billion triples overall (i.e., globally)

23

on-demand computation

We tackle these relations using VOLT, only computing triples on-demand.

24

generalizing

extending volt

The proxy natively handles flow control, scoped variables, operationalexpressions and SPARQL queries.

For more advanced operations, it also supports external systems such asspawning child processes to employ algorithms in libraries, make HTTPrequests, read/write from file system, etc.

26

postgis

For instance, we developed a VOLT plugin that enables users anddevelopers to call the spatial functions found in PostGIS on their data

@prefix postgis: <http://postgis.net/functions/>

postgis:areapostgis:azimuthpostgis:centroidpostgis:closestPointpostgis:clusterWithinpostgis:containspostgis:coverspostgis:coveredBypostgis:crossespostgis:disjoint...

With this set of EVT functions combined with the native capabilities ofthe proxy, we were able to perform complex spatial queries much shorterand simpler than their GeoSPARQL equivalents

27

using geosparql

Suppose we want to compute the sum of populations for counties alongthe coast using dbo:populationTotal, or dbp:populationTotal ifthe former is not valid/available.

With GeoSPARQL we can do this with the following query:# count the population of coastal counties in Californiaselect (sum(?countyPopulation) as ?coastalPopulation) where {data:PacificCoast geo:hasGeometry/geo:asWkt ?pacificCoastWkt .{ select ?county (sample(?population) as ?countyPopulation) {

?county a yago:CaliforniaCounties .?county geo:hasGeometry/geo:asWKT ?countyWkt .filter(regex(?countyWkt, '^(<[^>]*>)?(MULTI)?POLYGON', 'i'))filter(geof:sfTouches(?countyWkt, ?pacificCoastWkt)){ ?county dbo:populationTotal ?population .

filter(isNumeric(?population))} union {?county dbp:populationTotal ?population .filter(isNumeric(?population))filter not exists {

?county dbo:populationTotal ?best_population .filter(isNumeric(?best_population))

}}} group by ?county }

}28

using volt with postgis

Or, we can call specific VOLT procedures to do the heavy lifting for us

# count the population of coastal counties in Californiaselect ?population ?area where {{ select (volt:cluster(?county) as ?setOfCounties) {

?county a yago:CaliforniaCounties .?county stko:along data:PacificCoast . } }

[] stko:sumOfPlaces [input:places ?setOfCounties ;input:propertyList (dbo:populationTotal dbp:populationTotal) ;output:sum ?population ;output:coveredArea ?area ; ]

}

volt:cluster acts as an aggregate function that compiles an RDFcollection to be used by the surrounding outer-select

stko:along tests for adjacency by using PostGIS functions to deal withsliver polygons

stko:sumOfPlaces computes the sum of a given property (or itsfallbacks) among a collection of places with geometries

29

conclusions

recap

▶ Data that are dependent should be computed to improve quality,coverage, and longevity

▶ Some data are better suited for on-demand computation rather thanbeing pre-computed

▶ Here, we explored its use with spatiotemporal data - but VOLT isgeneric, and is equally prepared for any domain

▶ The VOLT proxy integrates seamlessly into existing technology andacts fully transparently

Our aim is to empower end-users’ computational abilities, provide themeans to inspect how computations are made, and track the provenanceof computed data.

31

thank you!

https://github.com/blake-regalia/volt

blake regalia @ gmail

32

top related