architecture tutorial provenance: overview professor luc moreau [email protected] university...
TRANSCRIPT
Architecture Tutorial
Provenance: overview
Professor Luc [email protected] of Southampton
www.ecs.soton.ac.uk/~lavm
Architecture Tutorial
Provenance & PASOA Teams
• University of Southampton– Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco, Sofia
Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen• IBM UK (EU Project Coordinator)
– John Ibbotson, Neil Hardman, Alexis Biller• University of Wales, Cardiff
– Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten, Shrija Rajbhandari• Universitad Politecnica de Catalunya (UPC)
– Steven Willmott, Javier Vazquez• SZTAKI
– Laszlo Varga, Arpad Andics, Tamas Kifor
• German Aerospace– Andreas Schreiber, Guy Kloss, Frank Danneman
Architecture Tutorial
Contents
• Motivation
• Provenance Concepts
• Provenance Architecture
• Standardisation
• Conclusions
Architecture Tutorial
Motivation
Architecture Tutorial
Scientific Research
Academic Peer Review
Architecture Tutorial
Business Regulations
Audit (Sarbanes-Oxley)
Audit (Basel II)
Accounting
Banking
Architecture Tutorial
Health Care Management
European Recommendation R(97)5: on the protection of medical data
Architecture Tutorial
e-Science datasets
• How to undertake peer-reviewing and validation of e-Scientific results?
Architecture Tutorial
Compliance to Regulations
• The “next-compliance” problem– Can we be certain that
by ensuring compliance to a new regulation, we do not break previous compliance?
Architecture Tutorial
Current Solutions
• Proprietary, Monolithic• Silos, Closed• Do not inter-operate
with other applications• Not adaptable to new
regulations
Architecture Tutorial
Provenance
• Oxford English Dictionary: – the fact of coming from some particular source or
quarter; origin, derivation– the history or pedigree of a work of art, manuscript,
rare book, etc.; – concretely, a record of the passage
of an item through its various
owners.
• Concept vs representation
Architecture Tutorial
Provenance in Computer Systems• Our definition of provenance in the context of
applications for which process matters to end users:
The provenance of a piece of data is the process that led to that piece of data
• Our aim is to conceive a computer-based representation of provenance that allows us to perform useful analysis and reasoning to support our use cases
Architecture Tutorial
Our Approach
• Define core concepts pertaining to provenance
• Specify functionality required to become “provenance-aware”
• Define open data models and protocols that allow systems to inter-operate
• Standardise data models and protocols• Provide a reference implementation• Provide reasoning capability
Architecture Tutorial
Context (1)
Aerospace engineering: maintain a historical record of design processes, up to 99 years.
Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients
Architecture Tutorial
Context (2)
High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN)
Bioinformatics: verification and auditing of “experiments” (e.g.for drug approval)
Architecture Tutorial
Provenance Concepts
Architecture Tutorial
Provenance “Lifecycle”
ApplicationApplication
Data Results
ProvenanceStore
Record Documentation of Execution
Query andReason overProvenance
of Data
AdministerStore and itscontents
Core Interfaces to Provenance Store
Architecture Tutorial
Nature of Documentation
• We represent the provenance of some data by documenting the process that led to the data:– documentation can be complete or partial;– it can be accurate or inaccurate; – it can present conflicting or consensual views
of the actors involved; – it can provide operational details of execution
or it can be abstract.
Architecture Tutorial
p-assertion
• A given element of process documentation will be referred to as a p-assertion
– p-assertion: is an assertion that is made by an actor and pertains to a process.
Architecture Tutorial
Service Oriented Architecture• Broad definition of service as component that takes
some inputs and produces some outputs. • Services are brought together to solve a given problem
typically via a workflow definition that specifies their composition.
• Interactions with services take place with messages that are constructed according to services interface specification.
• The term actor denotes either a client or a service in a SOA.
• A process is defined as execution of a workflow
Architecture Tutorial
M1
M2
M3
M4
Actor 1 Actor 2
I received M1, M4I sent M2, M3
I received M3I sent M4
From these p-assertions, we can derive that M3 was sent by Actor 1and received by Actor 2 (and likewise for M4)
If actors are black boxes, these assertions are not very useful because we do not know dependencies between messages
Process Documentation (1)
Architecture Tutorial
M1
M2
M3
M4
Actor 1 Actor 2
M2 is in reply to M1M3 is caused by M1M2 is caused by M4
M4 is in reply to M3
These assertions help identify order of messages,but not how data was computed
Process Documentation (2)
Architecture Tutorial
f
M1
M2
M3
M4
Actor 1 Actor 2
f1
f2
M3 = f1(M1)M2 = f2(M1,M4) M4 = f(M3)
These assertions help identify how data is computed,but provide no information about non-functional characteristics of the computation(time, resources used, etc)
Process Documentation (3)
Architecture Tutorial
M1
M2
M3
M4
Actor 1 Actor 2
I used 386 clusterRequest sat inqueue for 6min
I used sparc processor
I used algorithm x version x.y.z
Process Documentation (4)
Architecture Tutorial
Types of p-assertions (1)
– Interaction p-assertion: is an assertion of the contents of a message by an actor that has sent or received that message
I received M1, M4I sent M2, M3
Architecture Tutorial
Types of p-assertions (2)– Relationship p-assertion: is an assertion, made
by an actor, that describes how the actor obtained an output message sent in an interaction by applying some function to input messages from other interactions (likewise for data)
M2 is in reply to M1M3 is caused by M1M2 is caused by M4
M3 = f1(M1)M2 = f2(M1,M4)
Architecture Tutorial
Types of p-assertions (3)
– Actor state p-assertion: assertion made by an actor about its internal state in the context of a specific interaction
I used sparc processor
I used algorithm xversion x.y.z
Architecture Tutorial
Data flow
• Interaction p-assertions allow us to specify a flow of data between actors
• Relationship p-assertions allow us to characterise the flow of data “inside” an actor
• Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result
Architecture Tutorial
Provenance Architecture
Architecture Tutorial
Interfaces to Provenance Store
ApplicationApplication
Results
ProvenanceStore
Record Documentation of Execution
Query andReason overProvenance
of Data
AdministerStore and itscontents
Architecture Tutorial
Architecture Tutorial
P-Assertion schemas
Architecture Tutorial
The p-structure
• The p-structure is a common logical structure of the provenance store shared by all asserting and querying actors
• Hierarchical• Indexed by interactions (interaction= 1 message
exchange)
Architecture Tutorial
Recording Protocol (Groth04-06)
• Abstract machines• DS Properties
– Termination– Liveness– Safety– Statelessness
• Documentation Properties– Immutability– Attribution– Datatype safety
• Foundation for adding necessary cryptographic techniques
Architecture Tutorial
Querying Functionality (Miles06)
• Process Documentation Query Interface: allows for “navigation” of the documentation of execution– Allows us to view the provenance store (i.e. the p-
structure) as if containing XML data structures– Independent of technology used for running
application and internal store representation– Seamless navigation of application dependent and
application independent process documentation
Architecture Tutorial
Querying Functionality (Miles06)
• Provenance Query Interface: allows us to obtain the provenance of some specific data
• A recognition that there is not “one” provenance for a piece of data, but there may be different, depending on the end-user’s interest
• Hence, provenance is seen as the result of a query:– Identify a piece of data at a specific execution point– Scope of the process of interest:
• Filter in/out p-assertions according to actors, process, types of relationships, etc
Architecture Tutorial
Standardisation
Architecture Tutorial
Standardisation Options
APIsProgrammatic
inter-op
Recording and querying
InterfacesService inter-op
Provenance ModelData inter-op
Architecture Tutorial
Purpose of Standardisation
ApplicationApplication
ProvenanceStores
Record Documentation of Execution
ApplicationApplication
Allow for multiple applications to document their execution.Applications may be running in different institutions.
Architecture Tutorial
Purpose of Standardisation
ApplicationApplication
ProvenanceStore
Record Documentation of Execution
Allow for multiple stores from multiple IT providers
ProvenanceStore
ProvenanceStore
Architecture Tutorial
Purpose of Standardisation
ProvenanceStore Query
Provenanceof Data
Allow for multiple stores from multiple IT providers
ProvenanceStore
Architecture Tutorial
Purpose of Standardisation
Allow for legacy, monolithic applications to expose theircontents (according to standard schema)
Convert in standard dataformat
Architecture Tutorial
Purpose of Standardisation
Allow third parties to host provenance stores, which are trusted by application owners but also auditors
ApplicationApplication
ProvenanceStore
Architecture Tutorial
Compliance Oriented Architectures
• Separate execution documentation from compliance verification
• Allows for multiple compliance verifications
• Allows for validation to take place across multiple applications, possibly run by different institutions (in particular, allows for outsourcing and subcontracting).
• Approach is suitable for e-scientific peer-reviewing and business compliance verification
ApplicationApplication
ProvenanceStore
Record Documentationof Execution
QueryProvenance
Of Data
Complianceverification
ApplicationApplication
ProvenanceStore
Record Documentationof Execution
Record Documentationof Execution
QueryProvenance
Of Data
ComplianceverificationComplianceverification
Architecture Tutorial
Organ Transplant ScenarioHospital
Electronic HealthcareManagement Service
Testing Lab
Architecture Tutorial
Hospital Actors
User Interface
Donor DataCollector
Brain DeathManager
Architecture Tutorial
What’s on the CD
• Documents relating to both PASOA and EU Provenance projects
• All the talks presented today• Handouts• Software
– PReServ (Paul Groth & Simon Miles)– The EU Provenance client side library
Architecture Tutorial
Conclusions
Architecture Tutorial
Standardising thedocumentation of
Business Processes
ProvenanceStore
Reco
rd
To Sum Up
Query
• Compliance check• Rerun/Reproduce• Analyse
• Provenance– Architecture– Methodology
Apply
Healthcare
DistributionFinance Aerospace
Automobile
Pharmaceutical
Slide from John Ibbotson
Architecture Tutorial
Overview of Today’s Talks
• Provenance Data Structures
• Recording and Querying Provenance– Break (30 minutes)
• Distribution and Scalability
• Security
• Methodology
Architecture Tutorial
Questions