archive integration with rdf

Archive integration at MattilsynetBouvet Tech Meetup 2014-06-11

Lars Marius Garshol, [email protected], http://twitter.com/larsga1

Archive integrations

A few systems integrated with the archive– every integration is expensive and painful

Need many more integrations– to reduce amount of manual work– hesitation because of cost

Consequences of integrations– if archive upgraded, must retest all systems– archive slows down integrated systems– changes to archive structure require

rewriting all integrationsArkiv

Regelverk

Fagsystem #2

Fagsystem #1

Nettsider

Rekrut-tering

Kvalitets-systemet

3

WebCruiter integration

Very simple project– integrate WebCruiter with ePhorte

Doing it with RDF because– it’s much easier and cheaper– want to extend to more integrations later– first step toward new architecture

Good example project– because it’s so simple

4

4

SESAM principles

Base everything on RDF and SDShare feeds– dynamic flows of structured data

Extracts from data sources do not map to a common model– instead, extract data as they are in the source– later translate to representation needed by

consumers– this way, changes in source or target do not spill

over to the other

No hard bindings from code to data model– code should have no knowledge of the data model– all data model-specific logic should be configuration– makes data changes much easier to handle

5

W3C standard– for interchange of structured data– has query language, schema languages, formats, ...

Essentially a graph database– known as a triple store– like Neo4j or similar– but standardized– and with many extra features

Note that databases are schemaless– so this is NoSQL– powerful query language with SPARQL

RDF?

6

Architecture

WebCruiter WS

XML in files

SDShare

Oversettelse

ePhorteRDF

SDShare

SDShare

Oversettelse

SDShare

ePhorte adapter

HTTP POST

HTTP POST

SPARQLUpdate

SPARQLUpdate

SPARQLUpdate

external call

Bus

Boxes in orange areSesam components

SDShare

A protocol for tracking changes in a data source– essentially allows clients to keep track of all changes, for

replication purposes– based on Atom and REST

Data source can be anything– triple store– relational database– XML files on disk– ...

Data flows as RDF– not an absolute must, but it’s how we do things

A CEN specification– http://sdshare.org

Basic workings

Server Client

Fragmen

t

Server publishes fragments representing changes in datastore

Client pulls these in, updateslocal copy of dataset

Fragmen

t

Fragmen

t

Fragmen

t

9

From WebCruiter to triple store

Fragmen

t

Fragmen

t

Fragmen

t

Fragmen

tXML adapter

SDShare server

Triple storeSDShare

client

On the server:• XPath queries to map to RDF

On the client:• Two URLs

11

Translation of metadataTitle: Søknad om betalingsutsettelseProcess: 384192Author: 123Customer: 789

Oversetter

Tittel: Søknad om betalingsutsettelseSak: 485283Ansvarlig: 456Kontakt: 987Doktype:IArkivdel:17

Application

Archive

ActiveDirectory

123

xyz

456

789

987

12

How the mapping works

Standard RDF vocabulary– mapping between properties– traversing properties to add values– uses owl:sameAs to map values

Java implementation– called metadata-translator (~500 LOC)– uses very simple SDShare push protocol– writes translated data to Virtuoso

Supports multiple mappings– configured using graphs so we know which

properties and values to translate to

13

What’s to be mapped?

Department cannot be mapped

– structure in WebCruiter added manually

Users cannot be mapped, either

– no common key– solved using Duke

Department can be defaulted

– in the cases where we know the user

WebCruiter ePhorte

14

Data transfer to translation

Simply write SPARQL queries to– produce fragment feed (based on

timestamps)– produce a fragment (trivial)– produce a snapshot (trivial)

Then configure SDShare client– just requires two URLs– translation receives an HTTP POST with

the fragment, then does its job

15

ePhorte adapter

Receives RDF– introspects the RDF and translates to Java API– Java API is stubs calling SOAP services

Given <foo> rdf:type <.../MyClass>– it looks up the Java class “MyClass” then

instantiates

Then, given <foo> <.../prop> “value”– it looks up method “setProp” on MyClass– calls object.setProp(“value”)

That’s it– requires translation to produce RDF exactly

aligned with Java API– means there’s no code

https://github.com/Mattilsynet/arkivgrensesnitt

16

Configuration

WebCruiter WS

XML in files

SDShare

Oversettelse

ePhorteRDF

SDShare

SDShare

Oversettelse

SDShare

ePhorte adapter

HTTP POST

external call

Bus

Look, ma, no code!

XPath mapping

RDF mapping

SQL queries

SPARQL queries

Look, ma, no code!not much code!

17

Properties

Adding more object types or properties is simple– we just extend the mapping (and maybe

queries)

Data quality improves with more data– if we don’t have the data to translate

employees that information gets lost– if the necessary mapping is added later

translation improves automagically

Adding more systems is very easy– requires more SDShare feeds plus

mappings

18

The public journal problem

Internet

DMZ Secure zone

Oracle

ePhorte

Journalapp

ePhorte

19

The public journal solution

Internet

DMZ Secure zone

Oracle

ePhorte

Journalapp

Oracle

ePhorte

RDFfilteredSDShare SDShare

20

Relatively small project, not that many hours– includes writing reusable ephorte-adapter– parts of writing the metadata translator, too– also the XML adapter– system documentation– automated deploy system based on Jenkins

Flexible, simple solution– most of it reusable– actually captures, as a side-effect, information not

available in any other system

Conclusion

21

Questions?

archive integration with rdf

Technology

data changes

men t server

data flows

data sources

extract data

archive integrations

data model code

rest data source