provenance for reproducible data science

39
Provenance for Reproducible Data Science Andreas Schreiber German Aerospace Center (DLR) Cologne/Berlin, Germany PyData Seattle 2017 > PyData Seattle 2017 > A. Schreiber Provenance for Reproducible Data Science > 06.07.2017 DLR.de Chart 1

Upload: andreas-schreiber

Post on 21-Jan-2018

130 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Provenance for Reproducible Data Science

Provenance for Reproducible Data Science

Andreas Schreiber

German Aerospace Center (DLR)

Cologne/Berlin, Germany

PyData Seattle 2017

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 1

Page 2: Provenance for Reproducible Data Science

Topics

• Introduction to provenance and PROV

• Modelling provenance for data processing

• Python APIs for provenance recording

• Provenance recording for Jupyter notebooks

• Storing provenance in graph databases

• Analysis of provenance information

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 2

Page 3: Provenance for Reproducible Data Science

Introduction

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 3

Study Industrial Mathematics

(dt. Technomathematik)

Soldier (SIGINT)

Scientist (Grid Computing)

Page 4: Provenance for Reproducible Data Science

Introduction

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 4

Co-Founder

Data Scientist

Patient

Simulation and Software Technology, Cologne/Berlin

Head of Intelligent and Distributed Systems department

Institute of Data Science, Jena

Head of Secure Software Engineering group

Page 5: Provenance for Reproducible Data Science

Reproducibility

Reproducibility in (data) science is based on

• Open Source Software

• Code Reviews

• Code Repositories

• Publications with code

• Container (Docker etc.)

• Workflows

• (Electronic) laboratory notebooks

• Open data formats

• Data management

• Metadata and Provenance

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 5

Page 6: Provenance for Reproducible Data Science

Metadata and Provenance

Each data file in data processing has two parts

• Data: Actual measured, simulated, or generated data results

• Metadata: Information that describes or relates to the data

Metadata can include a large variety of information

• Geographic information that can be used to limit a spatial search

• Quality information (“Data is bad for some reason,” “Granule is cloud obscured”)

• Instrument configuration information (“Instrument in spectral zoom mode,” “Spacecraft

maneuver in progress”)

• Extra information about the data files themselves: file size, checksum for data integrity

verification

• Provenance information (Where did I get this file? How did it come to exist?)

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 6

Page 7: Provenance for Reproducible Data Science

Provenance

Basics

• Provenance refers to the source of information and the process that led to its existence

• Provenance information is critical to users trying to understand where a particular data file

came from

Other and related terms

• Traceability

• Lineage

• Logging

• Monitoring

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 7

Page 8: Provenance for Reproducible Data Science

Provenance Information

Capture, archive, and distribute provenance information, for example

• The source of all externally supplied data files

• The source of the algorithms used to transform the data within the system

• Algorithm Design Documents

• A complete description of the processing environment

• A complete description of the processing framework

• A record of each job’s execution

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 8

Page 9: Provenance for Reproducible Data Science

Big Data Processing

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 9

High-Performance Computing

• Explicit Control

Distributed Computing

• Implicit Control (via Graphs)

MPI Processes Threads Hadoop Spark Dask

Scale Up Scale Out

Page 10: Provenance for Reproducible Data Science

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 10

Data Science Workflows

Page 11: Provenance for Reproducible Data Science

More Formal Definition of Provenance

Provenance is

information about entities, activities, and people

involved in

producing a piece of data or thing,

which can be used to form

assessments about its quality, reliability or trustworthiness.

PROV W3C Working Group

https://www.w3.org/TR/prov-overview

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 11

Page 12: Provenance for Reproducible Data Science

W3C Specification „PROV“

• PROV-O, the PROV ontology, an OWL2 ontology allowing the mapping of the PROV data

model to RDF

• PROV-DM, the PROV data model for provenance

• PROV-N, a notation for provenance aimed at human consumption

• PROV-CONSTRAINTS, a set of constraints applying to the PROV data model

• PROV-XML, an XML schema for the PROV data model

• PROV-AQ, mechanisms for accessing and querying provenance

• PROV-DICTIONARY introduces a specific type of collection, consisting of key-entity pairs

• PROV-DC provides a mapping between PROV-O and Dublin Core Terms

• PROV-SEM, a declarative specification in terms of first-order logic of the PROV data model

• PROV-LINKS introduces a mechanism to link across bundles

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 12

Page 13: Provenance for Reproducible Data Science

PROV Elements

Entities

• Physical, digital, conceptual, or other kinds of things

• For example, documents, web sites, graphics, or data sets

Activities

• Activities generate new entities or

make use of existing entities

• Activities could be actions or processes

Agents

• Agents takes a role in an activity and have

the responsibility for the activity

• For example, persons, pieces of software,

or organizations

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 13

Activity

Entity

Agent

Page 14: Provenance for Reproducible Data Science

PROV Relations

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 14

Activity

Entity

Agent

wasGeneratedBy

used

wasDerivedFrom

wasAttributedTo

wasAssociatedWith

Page 15: Provenance for Reproducible Data Science

Baking a Cake

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 15

100 g butter

bake

2eggs

100 gsugar

100 gflour

cake

used used

used

used

wasGeneratedBywasDerivedFrom

Page 16: Provenance for Reproducible Data Science

Textual Representations Visualizations

PROV Notations and Representations

• Formats: PROV-N, JSON, Turtle, XML, …

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 16

document

prefix userdata http://software.dlr.de/qs/userdata/

. . .

wasDerivedFrom(userdata:weights, userdata:WeightReport.csv,

wasDerivedFrom(qs:graphic/weights, userdata:weights,

wasAssociatedWith(qs:graphic/weights, qs:user/[email protected], -)

used(python_method:read_csv, library:pandas, -)

used(python_method:matplotlib_plot, userdata:weights, -)

used(python_method:matplotlib_plot, library:matplotlib, -)

used(python_method:read_csv, userdata:WeightReport.csv, -)

wasAttributedTo(userdata:WeightReport.csv, qs:user/[email protected])

agent(qs:user/[email protected], [prov:type="prov:Person"])

entity(library:pandas, [library:version="0.17.1"])

entity(userdata:WeightReport.csv)

entity(userdata:weights)

. . .

endDocument

Page 17: Provenance for Reproducible Data Science

Provenance Architecture

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 17

Provenance

Store

Recording of Data Processing

Information

Application

Data (Results)

Page 18: Provenance for Reproducible Data Science

Storing and Retrieving Provenance

Some Storage Technologies

• Relational databases and SQL

• XML and Xpath

• RDF and SPARQL

• Graph databases and Gremlin/Cypher

Services

• REST APIs

• PROVSTORE

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 18

Page 19: Provenance for Reproducible Data Science

ProvStore

University of Southampton

• RESTful web service

• storage and access of

provenance documents

• Public and private

documents

• Conversion to various

text formats

• Simple visualizations

• APIs

• Python

• jQuery

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 19

https://provenance.ecs.soton.ac.uk/store/

Page 20: Provenance for Reproducible Data Science

Graphs

Provenance is a Directed Acyclic Graph (DAG)

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 20

A

B

E

F

GD

C

Page 21: Provenance for Reproducible Data Science

Graph Databases

Naturally, graph databases are a good

technology for storing (Provenance) graphs

Many graph databases are available

• Neo4J

• Titan

• ArangoDB

• ...

Query languages

• Cypher

• Gremlin (TinkerPop)

• GraphQL

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 21

Page 22: Provenance for Reproducible Data Science

Neo4j

• Open-Source

• Implemented in Java

• Stores property graphs

(key-value-based, directed)

http://neo4j.com

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 22

Page 23: Provenance for Reproducible Data Science

Storing Provenance in Graph Database

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 23

Graph database Neo4j

MATCH (e:Entity)-[*]-(u:Agent) RETURN u

Page 24: Provenance for Reproducible Data Science

Trusted Provenance: Storing Provenance in a Blockchain

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 24

PROV2BIGCHAINDB

https://github.com/DLR-SC/prov2bigchaindb

Page 25: Provenance for Reproducible Data Science

Gather or Generate Provenance

Depends on your application (tools, languages, etc.)

• Generation at run-time, compile-time, or retrospectively

Runtime

• Instrumentation of the application

• Cumbersome from software engineering perspective

• Combined with logging or with aspect-oriented approaches

Compile time

• Based on static code analysis (dependency analysis, program slicing, etc.)

Retrospectively

• Reconstructed from files or filesystem metadata

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 25

Page 26: Provenance for Reproducible Data Science

Tools and Libraries for Python

Libraries for Python

• PROVPY

• PROVNEO4J

• PROV-DB-CONNECTOR

Other Tools

• NOWORKFLOW

• GIT2PROV

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 26

Page 27: Provenance for Reproducible Data Science

Python Library ProvPy (PROV)

https://github.com/trungdong/prov

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 27

from prov.model import ProvDocument

# Create a new provenance document

d1 = ProvDocument()

# Entity: now:employment-article-v1.html

e1 = d1.entity('now:employment-article-v1.html')

# Agent: nowpeople:Bob

d1.agent('nowpeople:Bob')

# Attributing the article to the agent

d1.wasAttributedTo(e1, 'nowpeople:Bob')

d1.entity('govftp:oesm11st.zip',

{'prov:label': 'employment-stats-2011',

'prov:type': 'void:Dataset'})

d1.wasDerivedFrom('now:employment-article-v1.html',

'govftp:oesm11st.zip')

# Adding an activity

d1.activity('is:writeArticle')

d1.used('is:writeArticle', 'govftp:oesm11st.zip')

d1.wasGeneratedBy('now:employment-article-v1.html', 'is:writeArticle')

Page 28: Provenance for Reproducible Data Science

Python Library ProvPy (PROV)

https://github.com/trungdong/prov

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 28

Page 29: Provenance for Reproducible Data Science

PROVNEO4J – Storing PROV Documents in Neo4j

https://github.com/DLR-SC/provneo4j

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 29

import provneo4j.api

provneo4j_api = provneo4j.api.Api(

base_url="http://localhost:7474/db/data",

username="neo4j", password="python")

provneo4j_api.document.create(prov_doc, name=”MyProv”)

Page 30: Provenance for Reproducible Data Science

PROVNEO4J – Storing PROV Documents in Neo4j

https://github.com/DLR-SC/provneo4j

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 30

Page 31: Provenance for Reproducible Data Science

PROV-DB-CONNECTOR

https://github.com/DLR-SC/prov-db-connector

Successor of PROVNEO4J

• Connectors for Neo4j

implemented

• ArangoDB planned

• APIs in REST

implemented

• ZeroMQ & MQTT planned

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 31

from prov.model import ProvDocument

from provdbconnector import ProvDb

from provdbconnector.db_adapters.in_memory import SimpleInMemoryAdapter

prov_api = ProvDb(adapter=SimpleInMemoryAdapter, auth_info=None)

# create the prov document

prov_document = ProvDocument()

prov_document.add_namespace("ex", "http://example.com")

prov_document.agent("ex:Bob")

prov_document.activity("ex:Alice")

prov_document.association("ex:Alice", "ex:Bob")

document_id = prov_api.save_document(prov_document)

print(prov_api.get_document_as_provn(document_id))

Page 32: Provenance for Reproducible Data Science

Provenance Instrumentation of TensorFlow

Provenance of TensorFlow workflows

• Tensor PROV Entity

• Operations PROV Activity

Example: MNIST with 400 training iterations

• 64581 database nodes

• 33549 Entities

• 31032 Activities

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 32

Page 33: Provenance for Reproducible Data Science

Example Query

• Shortest paths from all tensors

in 400. iteration to init operation

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 33

MATCH path=allShortestPaths((root)<-[*]-(n))

WHERE root.`tf:type`="tf:Session_init" and n.`tf:name` =~ ".*_400"

RETURN path

Page 34: Provenance for Reproducible Data Science

NOWORKFLOW – Provenance of Scripts

https://github.com/gems-uff/noworkflow

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 34

Project

experiment.py

p12.dat

p13.dat

precipitation.py

p14.dat

out.png

$ now run -e Tracker experiment.py

Page 35: Provenance for Reproducible Data Science

GIT2PROV

http://git2prov.org

• Generate PROV documents from

git repositories

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 35

Page 36: Provenance for Reproducible Data Science

GIT2PROV Example Output

https://provenance.ecs.soton.ac.uk/store/documents/116377/

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 36

Page 37: Provenance for Reproducible Data Science

Provenance Visualization

Visualization of Provenance is an ongoing research topic

• Especially, for non-experts (“Provenance for people”)

• Example: PROV COMICS

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 37

Page 38: Provenance for Reproducible Data Science

Key Messages and Summary

Recording the Provenance of Data Science workflows is important

• to understand where data came from

• to reproduce data processing steps or whole workflows

Use a standard for Provenance

• W3C standard PROV

• Mapping to (graph) databases, allows easy querying

• A standard allow interoperability and comparison

Recording Provenance is not hard

• APIs for Python

• Tools

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 38

Activity

Entity

Agent

wasGeneratedBy

used

wasDerivedFrom

wasAttributedTo

wasAssociatedWith

Page 39: Provenance for Reproducible Data Science

> PyData Seattle 2017 > A. Schreiber • Provenance for Reproducible Data Science > 06.07.2017DLR.de • Chart 39

Thank You!

Questions?

[email protected]

www.DLR.de/sc | @onyame