datagraft: data-as-a-service for open data

138
DataGraft Data-as-a-Service for Open Data Dumitru Roman [email protected] https://datagraft.net

Upload: dapaasproject

Post on 15-Apr-2017

225 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: DataGraft: Data-as-a-Service for Open Data

DataGraftData-as-a-Service for Open Data

Dumitru [email protected]

https://datagraft.net

Page 2: DataGraft: Data-as-a-Service for Open Data

About me

• Education

– Eng (2003), Technical University of Cluj-Napoca, Romania

– PhD (2008), University of Innsbruck, Austria

• Current positions

– Senior Research Scientist, SINTEF, Norway

– Associate Professor, University of Oslo, Norway

• Expertise and responsibilities

– Initiating, leading, and carrying out (research-intensive) projects on data management and service-oriented topics

– Involved with over 20 large-scale R&D projects at the European level during the past 12 years

2

Page 3: DataGraft: Data-as-a-Service for Open Data

“Technology for a better society”• Public and private

companies

• Data owners

• Data publishers

• Data integrators and

aggregators

• Developers

• Improved data access

• Data-driven decision making

• Cost reduction when

working with data

• Reduction on the

dependency on generic

infrastructures providers

(e.g. generic cloud)

• Increase in the speed of

making data available

• Increase in the reuse of data

• Data cleaning

• Data transformation

• Data publication

• Data-as-a-Service

• Open data

• Linked data (RDF, SPARQL)

DataGraft 3

Page 4: DataGraft: Data-as-a-Service for Open Data

4

Page 5: DataGraft: Data-as-a-Service for Open Data

Outline

Session #1: Open Data

• Open Data

• (Open) Data Quality Issues

• Linked (Open) Data– RDF, RDFS, SPARQL

Session #2: DataGraft

• Data-as-a-Service: DataGraft

• Examples and Demo

• Big Data and DataGraft

• Open Data in Malaysian context (by Dennis Gan)

• (Optional: Hands on)

5

What is Open Data?What is Linked Data?

Challenges in (Linked Open) Data?

How to publish Linked Open Data?Linked Open Data Use Cases?

(Linked) Open Data and Big Data?

Page 6: DataGraft: Data-as-a-Service for Open Data

Open Data

Page 7: DataGraft: Data-as-a-Service for Open Data

What can open data do for you? (Source: The ODI, https://vimeo.com/110800848)

7

Page 8: DataGraft: Data-as-a-Service for Open Data

Open Data

…is changing the nature of business

...reflects a cultural shift to a more open society

8

Page 9: DataGraft: Data-as-a-Service for Open Data

Example: Personalized and Localized Urban Quality Index (PLUQI)

The index includes data from various domains:

Daily life satisfactionweather, transportation, community, …

Healthcare levelnumber of doctors, hospitals, suicide statistics, …

Safety and securitynumber of police stations, fire stations, crimes per capita, …

Financial satisfaction prices, incomes, housing, savings, debt, insurance, pension, …

Level of opportunityjobs, unemployment, education, re-education, …

Environmental needs and efficiencygreen space, air quality,…

9

Page 10: DataGraft: Data-as-a-Service for Open Data

PLUQI – potential usage

• Place recommendation for travel agencies or travelers

• Policy analysis and optimization for (local) government

• Understanding the citizen’s voice and demands regarding environmental conservation

• Commercial impact analysis for retailer and franchises

• Location recommendation and understanding local issues for real estate

• Risk analysis and management for insurance and financial companies

• Local marketing and sales force optimization for marketers

10

Page 11: DataGraft: Data-as-a-Service for Open Data

Open Data

• Businesses can develop new ideas, services and applications; improve decision making, cost savings

• Can increase government transparency and accountability, quality of public services

• Citizens get better and timely access to public services

11Source: McKinsey http://www.mckinsey.com/insights/business_technology/open_data_unlocking_innovation_and_performance_with_liquid_information

Gartner:

By 2016, the use of "open data" will continue to

increase — but slowly, and predominantly limited to

Type A enterprises.

By 2017, over 60% of government open data

programs that do not effectively use open data

internally, will be scaled back or discontinued.

By 2020, enterprises and governments will fail to

protect 75% of sensitive data and will declassify and

grant broad/public access to it.

Source: Garner http://training.gsn.gov.tw/uploads/news/6.Gartner+ExP+Briefing_Open+Data_JUN+2014_v2.pdf

Page 12: DataGraft: Data-as-a-Service for Open Data

Lots of open datasets on the Web…

• A large number of datasets have been published as open data in the recent years

• Many kinds of data: cultural, science, finance, statistics, transport, environment, …

• Popular formats: tabular (e.g. CSV, XLS), HTML, XML, JSON, …

12

Page 13: DataGraft: Data-as-a-Service for Open Data

…but few actually used

• Few applications utilizing open

and distributed datasets at present

• Challenges for data consumers

– Data quality issues

– Difficult or unreliable data access

– Licensing issues

• Challenges for data publishers

– Lack of expertise & resources: not easily to publish & maintain high quality data

– Unclear monetization & sustainability

13

Open Data Portal Datasets Applications

data.gov ~ 200 000 ~ 80

publicdata.eu ~ 48 000 ~ 85

data.gov.uk ~ 31 000 ~ 390

data.norge.no ~ 620 ~ 60

data.gov.my ~ 1065 ~ 10

Page 14: DataGraft: Data-as-a-Service for Open Data

Lots of datasets are in tabular format

– Records organized in silos of collections

– Very few links within and/or across collections

– Difficult to understand the nature of the data

– Difficult to integrate / query

14

europeandataportal.eu

Page 15: DataGraft: Data-as-a-Service for Open Data

Openlyavailable on the web as a document

Available under structured format (XLS)

Available under non-proprietary formats (CSV)

Uses URIs to denote things

Linked to other data to provide context

Tim Berners-Lee's 5 stars open data

rating system

15

Page 16: DataGraft: Data-as-a-Service for Open Data

1-Star Benefits

Consumers:

Ability to look at, print, store, modify and share data

Ability to use data as input to a system

Publishers:

Easily publish data

Ensure transparency

5-Star Benefits

Consumers:

Discover more (related) data while consuming the data

Directly learn about the data schema

? Have to deal with broken data links

? Trust issues

Publishers:

Make data discoverable

Increase the value of data

Gain the same benefits from the links as the consumers

? Need to invest resources to link data

? May need to clean data

16

Page 17: DataGraft: Data-as-a-Service for Open Data

Tabular Data Graph Data

• Lots of open datasets are in tabular format

• CSV, Excel, TSV, etc.

• Records organized in silos of collections

• Very few links within and/or across

collections

• Difficult to understand the nature of the data

• Difficult to integrate / query

Based on Linked Data• Method for publishing data on the Web

• Self-describing data and relations

• Interlinking

• Accessed using semantic queries

• Open standards by W3C− Data format: RDF

− Knowledge representation: RDFS/OWL

− Query language: SPARQL

http://www.w3.org/standards/semanticweb/data

europeandataportal.eu

17

Page 18: DataGraft: Data-as-a-Service for Open Data

Tabular Data

GraphData

18

Page 19: DataGraft: Data-as-a-Service for Open Data

(Open) Data Quality Issues

Page 20: DataGraft: Data-as-a-Service for Open Data

Tabular data

Tabular data is data that is structured into rows and columns

Correspondence with reality:

1) Each row represents an entity

2) Each column header represents an attribute of entity

3) Each column value represents a value of attribute

4) Each table represents a collection of entities

20

Page 21: DataGraft: Data-as-a-Service for Open Data

Tabular data files

Tabular data can be stored in different formats:

Tabular Text Formats (pure tabular data)Delimiter-separated values:

- CSV – comma-separated values- Less common, including TSV – tab-separated values, colon-separated values etc.

Spreadsheet Formats (meta-data information about the document, tabular data, formulas)

- XLS (Excel spreadsheet)- XLSX (Excel 2007 format)

21

Page 22: DataGraft: Data-as-a-Service for Open Data

Tabular data quality issues

When a dataset does not satisfy specified data quality criteria, it means that it contains data quality issues.

In order to provide higher data quality, these quality issues should be detected and removed.

22

Page 23: DataGraft: Data-as-a-Service for Open Data

Types of quality issues

23

Page 24: DataGraft: Data-as-a-Service for Open Data

Types of quality issues

24

Page 25: DataGraft: Data-as-a-Service for Open Data

Types of quality issues

25

Page 26: DataGraft: Data-as-a-Service for Open Data

What types of data quality issues can occur?

26

Page 27: DataGraft: Data-as-a-Service for Open Data

Types of quality issues

27

Page 28: DataGraft: Data-as-a-Service for Open Data

Types of quality issues

Actual information model:

order

street

house

28

Page 29: DataGraft: Data-as-a-Service for Open Data

Types of quality issues

Actual information model:

orderhas address

address29

Page 30: DataGraft: Data-as-a-Service for Open Data

Types of quality issues

30

Page 31: DataGraft: Data-as-a-Service for Open Data

Types of quality issues

Data model:

observationhas make

make31

Page 32: DataGraft: Data-as-a-Service for Open Data

Types of quality issues

Data model:

observation

make

year

number32

Page 33: DataGraft: Data-as-a-Service for Open Data

Summary of data quality issues

33

Page 34: DataGraft: Data-as-a-Service for Open Data

How to resolve data quality issues?

Workflow:

1) Identify data quality issues

2) Define transformation functions to resolve them

3) Execute transformation and verify the result

34

Page 35: DataGraft: Data-as-a-Service for Open Data

Transformation function types

By scope:

Functions on rows

Functions on columns

Functions transforming entire

dataset

By caused effect:

Data reordering functions

Data extraction functions

Data manipulation functions

Data enrichment functions

35

Page 36: DataGraft: Data-as-a-Service for Open Data

Transformation functionsScope Name Description Effect

Rows

Add Row Create a new record in a dataset Data enrichment

Take/Drop Rows Extract only relevant rows by indexData extraction. Resolves issues: “Rows, describing entities not belonging to a collection”

Shift Row Change row's position inside a dataset Data reordering, simplifies quality issues detection

Filter Rows Extract only relevant rows by conditionData extraction. Resolves issues: “Rows, describing entities not belonging to a collection”

Entiredataset

RemoveDuplicates

Remove similar rows Data extraction. Resolves issues: “Duplicate rows”

Sort DatasetSorts dataset by given column names in given order

Data reordering, simplifies quality issues detection

Reshape Dataset (Melt)

Move columns to rowsData manipulation. Resolves issues: “Column headers, containing attribute values”

Reshape Dataset(Cast)

Move rows to columns by categorizing and aggregating

Data enrichment, simplifies quality issues detection

Group and Aggregate

Group values by column or multiple columns and perform aggregation

Data enrichment, simplifies quality issues detection

Columns

Add ColumnAdd a column with a manually specified value

Data enrichment

Derive ColumnAdd a column with values, computed from other columns

Data enrichment

Take/Drop Columns

Take or drop selected column(s) Data extraction. Resolves issues: “Columns not related to model”

Shift Column Arbitrarily change column's order Data reordering, simplifies quality issues detection

Merge Columns Merge columns using custom separatorData manipulation. Resolves issues: “Single value is splitted across multiple columns”

Split Column Split column using custom separatorData manipulation. Resolves issues: “Multiple values stored in one column”

Rename Columns Change column headers Data manipulation. Resolves issues: “Incorrect column headers”

Map columns Apply function to all values in a columnData manipulation. Resolves issues: “Illegal values”, “Missing values”, “Inconsistent values” 36

Page 37: DataGraft: Data-as-a-Service for Open Data

Tabular data cleaning tools

CLI tools (e.g. Unix awk, csvkit, CSVfix) – lack of convenient user interface

Programming languages and libraries for data analysis (R, agate for Python) – users need knowledge in programming

Spreadsheet software (Microsoft Excel, LibreOffice Calc, Google Spreadsheets) - were not initially created for data cleaning, hard to debug, code is mixed up with data

Frameworks/tools designed to be used for interactive data cleaning and transformation in ETL process

37

Page 38: DataGraft: Data-as-a-Service for Open Data

Example: vehicle registration data

https://www.ssb.no/statistikkbanken/selectvarval/Define.asp?subjectcode=&ProductId=&MainTable=RegKjoretoy&nvl=&PLanguage=1&nyTmpVar=true&CMSSubjectArea=transport-og-reiseliv&KortNavnWeb=bilreg&StatVariant=&checked=true

38

Page 39: DataGraft: Data-as-a-Service for Open Data

Example: vehicle registration data (continued)

* Data obtained from StatBank Norway https://www.ssb.no/en/statistikkbanken 39

Page 40: DataGraft: Data-as-a-Service for Open Data

Map columns – applying a function to all values in a column

Effect: data manipulation

Resolves anomalies: Illegal values, Missing values, Inconsistent values

Required parameters:

For all columns that should be mapped

1) Name of column to manipulate

2) Name of function to apply

40

Page 41: DataGraft: Data-as-a-Service for Open Data

Before:

Map columns – apply function to all values in a column

41

Page 42: DataGraft: Data-as-a-Service for Open Data

After:

Map columns – apply function to all values in a column

42

Page 43: DataGraft: Data-as-a-Service for Open Data

Derive column – add a column with values computed from others

Effect: data enrichment

Adds new information to data

Required parameters:

1) Name of derived column

2) Column(s) to derive from

3) Function to derive with

43

Page 44: DataGraft: Data-as-a-Service for Open Data

Before:

Derive column – add a column with values computed from others

44

Page 45: DataGraft: Data-as-a-Service for Open Data

After:

Derive column – add a column with values computed from others

45

Page 46: DataGraft: Data-as-a-Service for Open Data

Cast dataset – move rows to columns by categorizing and aggregating

Effect: data enrichment

Adds new information to data, simplifies anomaly detection

Required parameters:

1) Column name for variable (what to categorize and put to headers)

2) Column name for value (on what to perform aggregations)

46

Page 47: DataGraft: Data-as-a-Service for Open Data

Before:

Cast dataset – move rows to columns by categorizing and aggregating

47

Page 48: DataGraft: Data-as-a-Service for Open Data

After:

Cast dataset – move rows to columns by categorizing and aggregating

48

Page 49: DataGraft: Data-as-a-Service for Open Data

RDF mapping

Reusing of existing vocabularies is encouraged. Helps to interlink data.

49

Page 50: DataGraft: Data-as-a-Service for Open Data

50

Page 51: DataGraft: Data-as-a-Service for Open Data

RDF mapping

http://vocabs.datagraft.net/vehicles

51

Page 52: DataGraft: Data-as-a-Service for Open Data

Linked (Open) DataRDF, RDFS, SPARQL

Page 53: DataGraft: Data-as-a-Service for Open Data

Linked Data

• Method for publishing data on the Web

• Self-describing data and relations

• Interlinking

• Accessed using semantic queries

http://www.w3.org/standards/semanticweb/data

53

Page 54: DataGraft: Data-as-a-Service for Open Data

Linked open data cloud

By Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak - http://lod-cloud.net/, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=36956792

54

Page 55: DataGraft: Data-as-a-Service for Open Data

Linked Data principles

• Every thing is represented by a URI

• URIs of things can be dereferenced

• Things are linked to other things by relating their URIs

55

Page 56: DataGraft: Data-as-a-Service for Open Data

Linked Data technology

• Data format:

• Knowledge representation: RDFS/OWL

• Query language:

• Linking medium: HTTP

56

Page 57: DataGraft: Data-as-a-Service for Open Data

Graph data structure

Alice

Jim

Peter

57

Page 58: DataGraft: Data-as-a-Service for Open Data

RDF in reality: using URLs to identify things

58

Page 59: DataGraft: Data-as-a-Service for Open Data

Resource Description Framework (RDF) Basics

• RDF making statements on resources (entities)

o Triple data model: subject -> predicate -> object (Alice's age is 34)

• Subjects and objects:

o Resources (URIs of entities) – can have properties related to them (http://my-domain.com/Alice)

o Literals – constant values ("female", "3.14159"); can not be subjects

o Blank nodes – used to specify composite properties (e.g., address which is composed of a country, city, street name, house number, zip code etc.)

• Realtionships (a.k.a. predicates) – relate one subject to one object

59

Page 60: DataGraft: Data-as-a-Service for Open Data

RDF serialisation formats

• Turtle family of RDF languages (N-Triples, Turtle, TriG and N-Quads)

60

<http://example.org/bob#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .

<http://example.org/bob#me> <http://xmlns.com/foaf/0.1/knows> <http://example.org/alice#me> .

<http://example.org/bob#me> <http://schema.org/birthDate> "1990-07 04"^^<http://www.w3.org/2001/XMLSchema#date> .

<http://example.org/bob#me> <http://xmlns.com/foaf/0.1/topic_interest> <http://www.wikidata.org/entity/Q12418> .

<http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/title> "Mona Lisa" .

<http://www.wikidata.org/entity/Q12418> <http://purl.org/dc/terms/creator> <http://dbpedia.org/resource/Leonardo_da_Vinci> .

<http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619> <http://purl.org/dc/terms/subject> <http://www.wikidata.org/entity/Q12418> .

• JSON-LD (JSON-based RDF syntax)

"@context": "example-context.json",

"@id": "http://example.org/bob#me",

"@type": "Person",

"birthdate": "1990-07-04",

"knows": "http://example.org/alice#me",

"interest": {

"@id": "http://www.wikidata.org/entity/Q12418",

"title": "Mona Lisa",

"subject_of": "http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619",

"creator": "http://dbpedia.org/resource/Leonardo_da_Vinci"

}

Page 61: DataGraft: Data-as-a-Service for Open Data

RDF serialisation formats (continued)

• RDFa (for HTML and XML embedding)

61

<body prefix="foaf: http://xmlns.com/foaf/0.1/ schema: http://schema.org/ dcterms: http://purl.org/dc/terms/">

<div resource="http://example.org/bob#me" typeof="foaf:Person">

<p>Bob knows <a property="foaf:knows" href="http://example.org/alice#me">Alice</a>

and was born on the <time property="schema:birthDate" datatype="xsd:date">1990-07-04</time>.</p>

<p>Bob is interested in <span property="foaf:topic_interest"

resource="http://www.wikidata.org/entity/Q12418">the Mona Lisa</span>.</p>

</div>

<div resource="http://www.wikidata.org/entity/Q12418">

<p>The <span property="dcterms:title">Mona Lisa</span> was painted by

<a property="dcterms:creator" href="http://dbpedia.org/resource/Leonardo_da_Vinci">Leonardo da Vinci</a>

and is the subject of the video

<a href="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">'La Joconde à Washington'</a>. </p>

</div>

<div resource="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">

<link property="dcterms:subject" href="http://www.wikidata.org/entity/Q12418"/>

</div>

</body>

Page 62: DataGraft: Data-as-a-Service for Open Data

RDF serialisation formats (continued)

• RDF/XML (XML syntax for RDF)

62

<?xml version="1.0" encoding="utf-8"?>

<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/"

xmlns:foaf="http://xmlns.com/foaf/0.1/"

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:schema="http://schema.org/">

<rdf:Description rdf:about="http://example.org/bob#me">

<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>

<schema:birthDate rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1990-07-04</schema:birthDate>

<foaf:knows rdf:resource="http://example.org/alice#me"/>

<foaf:topic_interest rdf:resource="http://www.wikidata.org/entity/Q12418"/>

</rdf:Description>

<rdf:Description rdf:about="http://www.wikidata.org/entity/Q12418">

<dcterms:title>Mona Lisa</dcterms:title>

<dcterms:creator rdf:resource="http://dbpedia.org/resource/Leonardo_da_Vinci"/>

</rdf:Description>

<rdf:Description rdf:about="http://data.europeana.eu/item/04802/243FA8618938F4117025F17A8B813C5F9AA4D619">

<dcterms:subject rdf:resource="http://www.wikidata.org/entity/Q12418"/>

</rdf:Description>

</rdf:RDF>

Page 63: DataGraft: Data-as-a-Service for Open Data

RDF Schema (RDFS)

• basic capabilities for describing RDF vocabularies

• includes concepts to describe:o classes, class hierarchies (sub-classes) and instances (typing)

o non-standard literal data types

o property hierarchies (sub-properties)

o predicate domain and range

o utility properties (labels, comments, additional information about things, definitions of reources)

o …

63

Page 64: DataGraft: Data-as-a-Service for Open Data

Linked data vocabulary sources

64

Page 65: DataGraft: Data-as-a-Service for Open Data

Querying RDF: SPARQL

• RDF Query language– Based on graph matching

• Uses SQL-like syntax

• Query types:– SELECT – table of raw values

– CONSTRUCT, DESCRIBE – RDF graph

– ASK – boolean

65

Page 66: DataGraft: Data-as-a-Service for Open Data

SPARQL querying – example graph

a:Alice c:Jimb:Peterfoaf:knows foaf:knows

foaf:Person

rdf:type

"Lissy" "Pety" "Jimbo"

foaf:nickfoaf:nick foaf:nick

foaf:knows

66

Page 67: DataGraft: Data-as-a-Service for Open Data

SPARQL querying – query

Question: What are the nicknames of people that Alice knows?

Query: @prefix a: <http://alice.org/> .

@prefix foaf: <http://xmlns.com/foaf/0.1/>

.

select where {

a:Alice foaf:knows .

foaf:nick

}

a:Alicefoaf:knows

?someonefoaf:nick

?nickname

67

Page 68: DataGraft: Data-as-a-Service for Open Data

SPARQL querying – matching to the graph

a:Alice c:Jimb:Peterfoaf:knows foaf:knows

foaf:Person

rdf:type

"Lissy" "Pety" "Jimbo"

foaf:nickfoaf:nick foaf:nick

foaf:knows

68

Page 69: DataGraft: Data-as-a-Service for Open Data

SPARQL querying – result

Query: @prefix a: <http://alice.org/> .

@prefix foaf: <http://xmlns.com/foaf/0.1/>

.

select where {

a:Alice foaf:knows .

foaf:nick

}

nickname

"Pety"

"Jimbo"

69

Page 70: DataGraft: Data-as-a-Service for Open Data

Data integration using Linked Data: using URIs

Example: Relational DB or spreadsheet – dataset about scientific publications:

ID Name Home page

1 Alice http://alice.org/

2 Tim https://www.w3.org/People/Berners-Lee/

ID author ISBN Publication topic

1 978-3-16-14410-0 "On the frictional coefficient of bananas"

1534-1-22-66975-1

"Do woodpeckers get headaches?"

2 1-933019-33-6 "The Semantic Web"

70

Page 71: DataGraft: Data-as-a-Service for Open Data

Data integration using Linked Data: using URIs (continued)

a:Alice

http://.../978-3-16-148410-0

http://.../534-1-22-663975-1

foaf:topic

foaf:topic

"On the frictional coefficient of bananas"

"Do woodpeckers get headaches?"

t:Tim http://.../1-933019-33-6foaf:publications

foaf:topic

"The Semantic Web"

Graph representation of new dataset:

71

Page 72: DataGraft: Data-as-a-Service for Open Data

Data integration using Linked Data: Using URIs (continued)

Same URI!

72

Page 73: DataGraft: Data-as-a-Service for Open Data

Data integration using Linked Data: Using URIs (continued)

a:Alice c:Jimb:Peterfoaf:knows foaf:knows

foaf:Person

rdf:type

"Lissy" "Pety" "Jimbo"

foaf:nickfoaf:nick foaf:nick

foaf:knows

…978-3-16-148410-0

…534-1-22-663975-1

foaf:topic

foaf:topic

"On the frictional coefficient of bananas"

"Do woodpeckers get headaches?"

Resulting graph:

73

Page 74: DataGraft: Data-as-a-Service for Open Data

Query federation using SPARQL

74

Page 75: DataGraft: Data-as-a-Service for Open Data

Linked Data is great for Open Data

• Linked Data is a great means to represent data– Semantics are part of the data

– Naturally linked to other data

– Querying language

• How Linked Data can improve Open Data:– Easier integration, free data from silos

– Seamless interlinking of data

– Understand the data

– New ways to query and interact with data

75

Page 76: DataGraft: Data-as-a-Service for Open Data

… but has been ignored by the mainstream

• Difficult to make it accessible to people

– Publishers

– Developers

– Data workers

• Challenges with using Linked Data

– Lack of tooling and expertise to publish high quality Linked Data

– Lack of resources to host LOD endpoints / unreliable data access

• DataGraft: packaging Linked Data to make it more approachable to the open data community

76

Page 77: DataGraft: Data-as-a-Service for Open Data

Data-as-a-Service: DataGraft

Page 78: DataGraft: Data-as-a-Service for Open Data

78

“Data is the new oil”…but many of us just need gasoline

Data-as-a-Service …is the new filling station

Page 79: DataGraft: Data-as-a-Service for Open Data

Data-as-a-Service

• Outsourcing of various data operations to the cloud

• Eliminates

– upfront costs on data infrastructure

– ongoing investment of time and resources in managing the data infrastructure

• Complete package for

– transformation of raw data into meaningful data assets

– reliable delivery of data assets

79

Page 80: DataGraft: Data-as-a-Service for Open Data

was developed to allow

data workers to manage their data in a

simple, effective, and efficient way

Powerful

data transformation and

reliable data access capabilities

80

DataGraft

Page 81: DataGraft: Data-as-a-Service for Open Data

Data Transformation and RDF Publication Process

• Interactive design of transformations?

• Repeatable transformations?

• Reuse/share transformations (user-based access)?

• Cloud-based deployment of transformations?

• Self-serviced process?

• Data and Transformation as-a-Service? 81

TransformGenerate

RDF

Ontology XOntology X

Ontology X

Ontology mapping

RDF GraphRaw Data Prepared Data

Map

Map

RDF Triple Store

Page 82: DataGraft: Data-as-a-Service for Open Data

Tabular Data

GraphData

DataGraft: Data-as-a-ServiceFor the Data Transformation and RDF Publication Process

82

Page 83: DataGraft: Data-as-a-Service for Open Data

83

https://www.ssb.no/statistikkbanken

Example: Using statistical data

Page 84: DataGraft: Data-as-a-Service for Open Data

84

Page 85: DataGraft: Data-as-a-Service for Open Data

85

Page 86: DataGraft: Data-as-a-Service for Open Data

86

Page 87: DataGraft: Data-as-a-Service for Open Data

87

Page 88: DataGraft: Data-as-a-Service for Open Data

88

Page 89: DataGraft: Data-as-a-Service for Open Data

89

Page 90: DataGraft: Data-as-a-Service for Open Data

90

Page 91: DataGraft: Data-as-a-Service for Open Data

91

Page 92: DataGraft: Data-as-a-Service for Open Data

92

Page 93: DataGraft: Data-as-a-Service for Open Data

93

Page 94: DataGraft: Data-as-a-Service for Open Data

94

Page 95: DataGraft: Data-as-a-Service for Open Data

95

Page 96: DataGraft: Data-as-a-Service for Open Data

96

Page 97: DataGraft: Data-as-a-Service for Open Data

97

Page 98: DataGraft: Data-as-a-Service for Open Data

98

Page 99: DataGraft: Data-as-a-Service for Open Data

99

Page 100: DataGraft: Data-as-a-Service for Open Data

100

Page 101: DataGraft: Data-as-a-Service for Open Data

101

Page 102: DataGraft: Data-as-a-Service for Open Data

102

Data records (rows)

Add rowTake row(s)Drop row(s)

Shift rowFilter rows (grep)

Remove duplicate rows

Entire datasetSort

Reshape datasetGroup (categorize) and aggregate

Columns

Add column(s)Take column(s)Drop column(s)Move column

Merge columnsSplit column

Rename column(s)Apply function to all values in a column

Page 103: DataGraft: Data-as-a-Service for Open Data

103

Page 104: DataGraft: Data-as-a-Service for Open Data

104

Page 105: DataGraft: Data-as-a-Service for Open Data

105

Page 106: DataGraft: Data-as-a-Service for Open Data

106

Page 107: DataGraft: Data-as-a-Service for Open Data

107

Page 108: DataGraft: Data-as-a-Service for Open Data

Data pages and federated querying

108

What is the population of locations and total number of persons employed in Human health and social work activities?

Page 109: DataGraft: Data-as-a-Service for Open Data

Configuring data visualizations

109

Page 110: DataGraft: Data-as-a-Service for Open Data

110

Page 111: DataGraft: Data-as-a-Service for Open Data

111

Page 112: DataGraft: Data-as-a-Service for Open Data

112

Page 113: DataGraft: Data-as-a-Service for Open Data

113

APIs

Page 114: DataGraft: Data-as-a-Service for Open Data

DataGraft key feature: Flexible management and sharing of data

and transformations

Fork, reuse and extend transformations built by other professionals from DataGraft’s

transformations catalog

Interactively build, modify and share data

transformations

Share transformations privately or publicly

Reuse transformations to repeatably clean and

transform spreadsheet data

Programmatically access transformations and the transformation catalogue

114

Page 115: DataGraft: Data-as-a-Service for Open Data

Reuse of transformations in environmental data publishing

TRAGSA Pilot

• Number of transformations: 42

– Created via reuse: 25

• Number of triples:

– ~ 7.7M

ARPA Pilot

• Number of transformations: 5

– Created via reuse: 2

• Number of triples:

– ~ 14K

115

Forking/reusing transformations helped us spend less time on creating new transformations

Page 116: DataGraft: Data-as-a-Service for Open Data

DataGraft key feature: Reliable data hosting and querying services

Host data on DataGraft’sreliable, cloud-based

semantic graph database

Share data privately or publicly

Query data through your own SPARQL

endpoint

Programmatically access the data

catalogue

116

Operations & maintenance performed on behalf of users

Page 117: DataGraft: Data-as-a-Service for Open Data

Grafter Grafterizer

Semantic Graph DBaaSData Portal

DataGraft

117

DataGraft Enablers

Page 118: DataGraft: Data-as-a-Service for Open Data

DataGraft – 1 package 2 audiences

DataGraft

Data Publisher Application Developer

Helping integrating and publishing data

Giving better, easier tools

118

Page 119: DataGraft: Data-as-a-Service for Open Data

Examples and Demo

Page 120: DataGraft: Data-as-a-Service for Open Data

The context: Statsbygg

120

• A public sector administration company

• Norwegian government's key advisor in construction and property affairs

• Building commissioner

• Property manager

• Property developer

• Interest: Exploit/Share property data in novel ways

• For efficiency and sustainability of the property included in the government's civil estate

Example: Reporting state-owned real estate properties in Norway

Page 121: DataGraft: Data-as-a-Service for Open Data

Example: Reporting state-owned real estate properties in Norway (cont’)

• A hard copy of 314 pages and as a PDF file

• 6 Person-Months• Data collection with spreadsheets• Quality assurance through e-mails

and phone correspondence

Pains• Time consuming• Poor data quality• Static report without live updating

• Live service• Efficient sharing of data• Simplified integration with external

datasets• Live updating• Reliable access• …

• Risk and vulnerability analysis, e.g. buildings affected by flooding

• Analysis of leasing prices

Report Reporting Service 3rd party services

121

Page 122: DataGraft: Data-as-a-Service for Open Data

Sample data

122

Cleaning, Transformation, Publishing, Integration, Querying, Visualization,

Service Access

Page 123: DataGraft: Data-as-a-Service for Open Data

Demo Scenario

• Interactively create tabular data transformations

• Reuse/extend data transformations (incl. data annotations)

• RDF data publication and querying

• Integrating and visualising data from different sources

• (Using 3rd party tools with DataGraft)

123

Page 124: DataGraft: Data-as-a-Service for Open Data

Demo sample data

124

Cleaning, Transformation, Publishing, Integration, Querying, Visualization,

Service Access

Page 125: DataGraft: Data-as-a-Service for Open Data

Demo sample data

125

Cleaning, Transformation, Publishing, Integration, Querying, Visualization,

Service Access

Page 126: DataGraft: Data-as-a-Service for Open Data

Benefits of DataGraft in use cases

• Simplified data publishing process

• Integration with external data sources using established web standards

• Data that was not publicly available – now published (e.g. air quality data in Oslo)

• Time-efficient publishing

• Repeatable data transformation process

126

Page 127: DataGraft: Data-as-a-Service for Open Data

DataGraft and Big Data

• Desired features:

– real-time interactivity

– large datasets batch transformation capability

We are developing a hybrid solution to work with both batch and real-time processing.

127

Page 128: DataGraft: Data-as-a-Service for Open Data

DataGraft and Big Data: High-level architecture

128

Page 129: DataGraft: Data-as-a-Service for Open Data

DataGraft – targeted impacts

Reduction in costsfor organisations which lack sufficient expertise and resources to make their data available

Reduction on the dependencyof data owners on generic Cloud platforms to build, deploy and maintain their linked data from scratch

Increase in the speed of publishing new datasets and updating existing datasets

Reduction in the cost and complexity of developing applications that use data

Increase in the reuse of data by providing reliable access to numerous datasets hosted on DataGraft.net

129

Page 130: DataGraft: Data-as-a-Service for Open Data

• Gathering enough of good datasets

• Designing/implementing

2. Able to focus onservice quality

Example: The benefit of DataGraft in PLUQI

130

• Reducing cost for implementing transformations

• Integrating the process is simpler

1. 23% of developmentcost reduction

Datasetsgathering

Datatransformation

Data provisioning/access

ImplementingApp

Before

Datasetsgathering

Datatransformation

Data provisioning/

access

ImplementingApp

After (with DataGraft)

Page 131: DataGraft: Data-as-a-Service for Open Data

DataGraft in numbers (as of end of Jan 2016)

131

238Registered users

607 (208 public)

Registered Data transformations

1828Uploaded files

192Public Data

pages

Page 132: DataGraft: Data-as-a-Service for Open Data

DataGraft in the wild

• Investigating crime data in small geographies

• Used DataGraft to transform data and publish RDF

132http://benproctor.co.uk/investigating-crime-data-at-small-geographies/

Page 133: DataGraft: Data-as-a-Service for Open Data

Data Science and DataGraft

Greater Data Science:

1. Data Exploration and Preparation

2. Data Representation and Transformation

3. Computing with Data

4. Data Visualization and Presentation

5. Data Modeling

6. Science about Data Science133

“50 years of Data Science” by David Donohohttp://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

DataGraft

Page 134: DataGraft: Data-as-a-Service for Open Data

134https://whatsthebigdata.com/2016/05/01/data-scientists-spend-most-of-their-time-cleaning-data/

Page 135: DataGraft: Data-as-a-Service for Open Data

135

Page 136: DataGraft: Data-as-a-Service for Open Data

Summary

• DataGraft – emerging Data-as-a-Service solution for making (linked) data more accessible

– Platform, portal, methodology, APIs

– Online service, functional and documented

– Validated through several use cases

• Key features:

– Support for Sharable/Repeatable/Reusable Data Transformations

– Reliable RDF Database-as-a-Service

136

Page 137: DataGraft: Data-as-a-Service for Open Data

https://datagraft.net

Thank you!Contact: [email protected] 137

Page 138: DataGraft: Data-as-a-Service for Open Data

138