distributed collaboration on rdf datasets using git: towards the quit store

34
Distributed Collaboration on RDF Datasets Using Git Towards the Quit Store Natanael Arndt, Norman Radtke and Michael Martin SEMANTiCS 2016, Leipzig September 14, 2016

Upload: linked-enterprise-date-services

Post on 14-Jan-2017

135 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Distributed Collaboration on RDF DatasetsUsing Git

Towards the Quit Store

Natanael Arndt, Norman Radtke and Michael Martin

SEMANTiCS 2016, Leipzig

September 14, 2016

Problem & Motivation

2 / 23

Problem & Motivation

Linked Datasets as of August 2014

Enterprise Workspace

clon

e

enrich

😱

Public LOD Cloud

3 / 23

Problem & Motivation

Remark (Co-Evolution)The process of datasets simultaneously evolving separated from eachother while influencing each others evolution

4 / 23

Problem & Motivation

Usage of public LOD as background knowledgeMobile use casesIn distributed collaboration on RDF datasets

⇒ Support for multiple versions of the same dataset at the sametime

5 / 23

Approach

The same problem exists forsource code repositoriesSince around 10 yearsdistributed version controlsystem are solving thisproblemMultiple working copy exist atthe same time and can besynchronized

Server/Client Server/Client

Server/Client

Server/Client

Server/Client

6 / 23

Approach

Git is successful in softwaredevelopment

We have decided to see if thisalso works for RDFSo we have put RDF into therepositories

Server/Client Server/Client

Server/Client

Server/Client

Server/Client

7 / 23

Approach

Git is successful in softwaredevelopmentWe have decided to see if thisalso works for RDF

So we have put RDF into therepositories

Server/Client Server/Client

Server/Client

Server/Client

Server/Client

7 / 23

Approach

Git is successful in softwaredevelopmentWe have decided to see if thisalso works for RDFSo we have put RDF into therepositories

Server/Client Server/Client

Server/Client

Server/Client

Server/Client

7 / 23

Methodology

Quit Store

RDFQuad Store

SPARQL 1.1 InterfaceQuery & Update Read/write interface

Translating read/writeoperations to versioningSynchronizes the store withthe current working copy

8 / 23

Methodology:Serialization of RDF data

Multiple RDF serialization formats are availableFor the versioning with Git we need:

Same RDF graph = same representationMinimal difference between versionsMeaningful difference between version

⇒ We have chosen a canonicalized N-Quads serialization

9 / 23

Methodology:Blank Nodes in Versioning

With RDF as exchange format, still blank nodes are a problemBlank nodes identifiers only have a local scope… are not persistent or portable identifiers… are purely an artifact of the serialization

We follow the recommendation of RDF 1.1, to replace blanknodes with IRIs

([Cyganiak et al., 2014] sections 3.4 and 3.5)

10 / 23

Methodology:Blank Nodes in Versioning

With RDF as exchange format, still blank nodes are a problemBlank nodes identifiers only have a local scope… are not persistent or portable identifiers… are purely an artifact of the serializationWe follow the recommendation of RDF 1.1, to replace blanknodes with IRIs

([Cyganiak et al., 2014] sections 3.4 and 3.5)

10 / 23

Methodology:Read/Write Interface

Quit Store

RDFQuad Store

SPARQL 1.1 InterfaceQuery & Update

SPARQL 1.1 Query and UpdateQuery proxy providing aSPARQL endpointExecutes Queries on the StoreTriggers read or writeoperations on the versioninglayer

11 / 23

Methodology:Translating Read/Write Operations

Quit Store

RDFQuad Store

SPARQL 1.1 InterfaceQuery & Update

SPARQL read/write operationsare transformed to commit,merge, revert, push and pull

12 / 23

Methodology:Commit

Is triggered by UPDATE QueriesThe changed graphs are added and commited in a new GitcommitA Commit contains lines resp. statements added/removed

A B

13 / 23

Methodology:Commit

A commit is always referring to its predecessor not vice versaWe can also create two commits with the same predecessor

Branching/Forking

A B C

14 / 23

Methodology:Commit

A commit is always referring to its predecessor not vice versaWe can also create two commits with the same predecessorBranching/Forking

A B C

D

14 / 23

Methodology:Merge

If the commits are diverged we need to synchronize the versions

Create a commit with two predecessorsStill we need to actually consolidate the graphs

A B C

D

15 / 23

Methodology:Merge

If the commits are diverged we need to synchronize the versionsCreate a commit with two predecessorsStill we need to actually consolidate the graphs

A B C

D

E

15 / 23

Methodology:Merge

Using the default three-way-merge from git

<urn:ex:Tilia> a <urn:ex:Tree> .

<urn:ex:Tilia> <urn:ex:age> "1000"^^xsd:integer .

<urn:ex:Tilia> <urn:ex:label> "Linda"@de .

<urn:ex:Tilia> <urn:ex:label> "Tilia"@en .

16 / 23

Methodology:Merge

Using the default three-way-merge from gitOn syntactical level Git produces conflicts

Branch A

<urn:ex:Tilia> a <urn:ex:Tree> .

<urn:ex:Tilia> <urn:ex:age> "1000"^^xsd:integer .

+ <urn:ex:Tilia> <urn:ex:height> "40"^^xsd:integer .

<urn:ex:Tilia> <urn:ex:label> "Linda"@de .

<urn:ex:Tilia> <urn:ex:label> "Tilia"@en .

Branch B

<urn:ex:Tilia> a <urn:ex:Tree> .

<urn:ex:Tilia> <urn:ex:age> "1000"^^xsd:integer .

+ <urn:ex:Tilia> <urn:ex:label> "Linde"@de .

- <urn:ex:Tilia> <urn:ex:label> "Linda"@de .

<urn:ex:Tilia> <urn:ex:label> "Tilia"@en .

16 / 23

Methodology:Merge

Using the default three-way-merge from gitOn syntactical level Git produces conflicts

Git Merge:

<urn:ex:Tilia> a <urn:ex:Tree> .

<urn:ex:Tilia> <urn:ex:age> "1000"^^xsd:integer .

<<<<<<< HEAD

<urn:ex:Tilia> <urn:ex:height> "40"^^xsd:integer . =======

<urn:ex:Tilia> <urn:ex:label> "Linda"@de . <urn:ex:Tilia> <urn:ex:label> "Linde"@de .

>>>>>>> typo

<urn:ex:Tilia> <urn:ex:label> "Tilia"@en .

16 / 23

Methodology:Merge

Using the default three-way-merge from gitOn syntactical level Git produces conflictsBut actually there is no conflictConflicts have to be looked for on other levels

<urn:ex:Tilia> a <urn:ex:Tree> .

<urn:ex:Tilia> <urn:ex:age> "1000"^^xsd:integer .

<urn:ex:Tilia> <urn:ex:height> "40"^^xsd:integer .

<urn:ex:Tilia> <urn:ex:label> "Linde"@de .

<urn:ex:Tilia> <urn:ex:label> "Tilia"@en .

16 / 23

Methodology:Revert

Reverting a commit undoes an earlier changeThis is done by exchanging the add- and delete-set of statements

A B B−1

17 / 23

Implementation

File References

SPARQL 1.1 Interface

Public Git Repository

Local Git Repository

Query-Analyzer

Quad-Store

SPARQL Query

Update

Dump to files

Select

Parse files

Response

Written in Python, using Flask API as HTTP interface and RDFlib forSPARQL and RDF

18 / 23

Integration

Quit Store has the role of managing the repositoryProvide the read/write interfaceSynchronize the repository and the store.

Quit Store

Quit Diff Δ( , )

Quit Merge

60%Quit Notify

Quit Store

RDFQuad Store

SPARQL 1.1 InterfaceQuery & Update

19 / 23

Integration

Quit Diff can calculate differences between commitsTrace provenance of statementTransmit patches to collaborators.

Quit Store

Quit Diff Δ( , )

Quit Merge

60%Quit Notify

Quit Store

RDFQuad Store

SPARQL 1.1 InterfaceQuery & Update

19 / 23

Integration & Future Work

Quit Notify can actively inform other clones of updatesThis enables distributed setups for collaboration andsynchronization.

Quit Store

Quit Diff Δ( , )

Quit Merge

60%Quit Notify

Quit Store

RDFQuad Store

SPARQL 1.1 InterfaceQuery & Update

20 / 23

Integration & Future Work

Quit Merge will implement various merge strategies for RDFDetect conflicts in diverged versions.

Quit Store

Quit Diff Δ( , )

Quit Merge

60%Quit Notify

Quit Store

RDFQuad Store

SPARQL 1.1 InterfaceQuery & Update

20 / 23

Poster

Quit DiffNatanael ArndtNorman Radtke

?LEDS

INKED NTERPRISE ATA ERVICESL E D S

P 3421 / 23

Conclusion

With Quit we have presented amethodology for

version control and trackingprovenance of contributions,synchronization: clone, push and pullby other participants, anddistributed collaboration on RDFdatasets (gitflow)

Hopefully this can help to utilize thebig ecosystem of methodologies andtools around Git

Questions?Natanael Arndt<[email protected]>

Quit Store

Quit Diff Δ( , )

Quit Merge

60%Quit Notify

Quit Store

RDFQuad Store

SPARQL 1.1 InterfaceQuery & Update

22 / 23

Conclusion

With Quit we have presented amethodology for

version control and trackingprovenance of contributions,synchronization: clone, push and pullby other participants, anddistributed collaboration on RDFdatasets (gitflow)

Hopefully this can help to utilize thebig ecosystem of methodologies andtools around GitQuestions?Natanael Arndt<[email protected]>

Quit Store

Quit Diff Δ( , )

Quit Merge

60%Quit Notify

Quit Store

RDFQuad Store

SPARQL 1.1 InterfaceQuery & Update

22 / 23

References I

Cyganiak, R., Wood, D., and Lanthaler, M. (2014).Rdf 1.1 concepts and abstract syntax.https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/.

23 / 23