enabling privacy in provenance- aware workflow systems susan b. davidson 1 joint work with sanjeev...
TRANSCRIPT
Enabling Privacy in Provenance-Aware Workflow Systems
Susan B. Davidson1
Joint work with
Sanjeev Khanna, Sudeepa Roy, Julia Stoyaonovich, Val Tannen 1
Yi Chen 2
Tova Milo3
1University of Pennsylvania 2Arizona State University 3Tel Aviv University
2
Science has been revolutionized
• High-throughput technologies generate floods of data, which must be analyzed to create knowledge.
• Scientists are turning to scientific workflow systems to help manage the analysis.
Scientific Workflow
Split Entries
Align Sequences
Functional Data Curate Annotations
Format-2
Format-1
Format-3
Construct Trees
3
Graphical representation of a sequence of actions to perform a task (e.g., a biological experiment)
Vertex ≡ Module (program) Edge ≡ Dataflow
Run: An execution of the workflow Actual data appears on the edges
.
TGCCGTGTGGCTAAATG…
CTGTGC
…CTAAATGTCTGTGC…
GGCTAAATGTCTGTGCCGTG
TGGCGTC…
ATCCGTGTGGCTA..
Need for provenance
4
TGCCGTGTGGCTAAATGTCTGTGC
…
CCCTTTCCGTGTGGCTAAATGTCTGTGC
…
TGCCGTGTGGCTAAATGTCTGTGC
GTCTGTGC…
TGCCGTGTGGCTAAATGTCTGTGC
GTCTGTGC…
TGCCGTGTGGCTAAATGTCTGTGC…
ATGGCCGTGTGGTCTGTGCCTAACTAACTAA…
…
Biologist’s workspace
Which sequences have been used to produce this tree?
How has this tree been generated?
?
s
Split Entries
Align Sequences
Functional Data Curate Annotations
Format
Format
Format
Construct Trees
t
Need for privacy
5
Outline
• The need: Workflow provenance repositories
• The challenge of privacy
• Privacy-aware search and query
Current situation: Workflow repositories
• To enable sharing and reuse, repositories of workflow specifications are being created– e.g. myExperiment.org– Keyword search is used to find
specifications of interest
• Several workflow systems are storing provenance information – Module executions, input parameters,
input/output data– “Input-only”
7
The Vision• “Workflow Provenance” repositories will store
specifications as well as executions (i.e. provenance information)– Searchable– Queryable– Privacy protecting
• Searching/querying these repositories can be used to– Find/reuse workflows– Understand meaning of a workflow– Correct/debug erroneous specifications– Find “bad” data and what its downstream effect
might be8
9
“You are better off designing in security and privacy… from the start, rather than trying to add them later.”
(Shapiro, CACM 53:6, June 2010)
Example: Module Privacy
11
P: (X1, X2, X3, X4, X5)
(X1, X2, X3) (X1, X2, X4, X5)
P has cancer?
Split entries
Check for Cancer
Check for Infectious disease
Create Report
report
Patient record: Gender, smoking habits, Familial environment,blood pressure, blood test report, …
P has an infectious disease?
If X1 > 60 OR
(X2 < 800 AND
X5=1) AND ….
Module functionality
should be kept secret
(From patient’s standpoint): output should not be guessed given input data values
(From module owner’s standpoint): no one should be able to simulate the module and use it elsewhere.
Example: Structural Privacy
12
M1 M2
Protein+
Functional annotation
Protein
M2 compares domains of proteins (more precise but
more time consuming)
M1 compares the entire protein
against already annotated genomes
Relationships between certain data/module pairs should be kept secret
13
Privacy concerns at a glance
P: (X1, X2, X3, X4)
(X1, X2, X3) (X1, X2, X4)
P has cancer?
P has infectious disease?
Data Privacy Data items are
private
Module Privacy Module
functionality is private (x, f(x))
Structural Privacy Execution paths
between certain data is private
Split entries
Check for cancer Check for infectious disease
Create Report
report
smoking habits, blood pressure,blood test report, ……
Check for cancer
DB
14
Module Privacy (a hint)• A module f = a function• For every input x to f, f(x) value should not be revealed
– Enough equivalent possible f(x) values w.r.t. visible information
According to required privacy
guarantee
• There is a knife, a fork and a spoon in this figure
arxiv.org/abs/1005.5543
I
O
DetermineGenetic
Susceptibility
M1
M2 EvaluateDisorder Risk
Consult External Databases
M3 Expand SNPSet
M4
GenerateDatabase Queries
QueryOMIM
CombineDisorder Sets
M5
M6
M8
W1 W2 W4
QueryPubMed M7
SNPs, ethnicity
lifestyle, family history, physical symptoms
disorders
prognosis
SNPs
disorders disorders
query query
W3
….
Workflow model, revisited Vertex ≡ Module
Atomic or composite (subworkflow)
Edge ≡ DataflowW1
W3
W2 W4
Workflow hierarchy
17
Access Views
• Prefixes of workflow hierarchy can be used to define the finest level of granularity the user can “see”
W1
W3
W2 W4 W1 W2 W4
W1 W2
W1:
I OM1 M2
I
O
DetermineGenetic
Susceptibility
M1
M2 EvaluateDisorder Risk
Consult External Databases
M3 Expand SNPSet
M4
GenerateDatabase Queries
QueryOMIM
CombineDisorder Sets
M5
M6
M8
W1 W2 W4
QueryPubMed M7
SNPs, ethnicity
lifestyle, family history, physical symptoms
disorders
prognosis
SNPs
disorders disorders
query query
Search Keyword search: “Database, disorder risk” Result should be concise and informative
W1 W2 W4
“WISE: a Workflow Information Search Engine”,(ICDE 2009)
I
O
M2
EvaluateDisorder Risk
M3
Expand SNPSet
GenerateDatabase Queries
QueryOMIM
CombineDisorder Sets
M5
M6
M8
QueryPubMed
M7
SNPs, ethnicity
lifestyle, family history, physical symptoms
disorders
prognosis
SNPs
disorders
disorders
query query
Result of “database, disorder risk”
W1 W2 W4
I
O
DetermineGenetic
Susceptibility
M1
M2 EvaluateDisorder Risk
Consult External Databases
M3 Expand SNPSet
M4
W1 W2
SNPs, ethnicity
lifestyle, family history, physical symptoms
disorders
prognosis
SNPs
Result of “database, disorder risk”
W1 W2
I
O
M2
EvaluateDisorder Risk
M3
Expand SNPSet
SNPs, ethnicity
lifestyle, family history, physical symptoms
prognosis
SNPs
disorders
M4 Consult External Databases
21
Queries
• “Workflows in which Expand SNP is performed before Query PubMed.”
• “Data items that were used to produce d19.”
• …
• Shares with search the need to match certain terms and give a view of the result and preserve privacy.
22
Research Challenges• More ways of enabling scientists to construct
workflows
• Formalizing privacy notions
• Provenance query language?
• Efficiently identifying data that users can access– users may have different privileges, yielding many
different access views.
• …
Acknowledgments• Members of the PennDB
group– Susan Davidson– Sanjeev Khanna– Sudeepa Roy– Julia Stoyanovich– Val Tannen
• Friend of PennDB and Tel Aviv faculty– Tova Milo
• PennDB alum and ASU faculty– Yi Chen
• Bioinformatics collaborators– Sarah Cohen-BoulakiaThis work was supported in part by NSF IIS-0803524, IIS-0629846, IIS-0915438, CCF-0635084, and IIS-0904314; NSF CAREER award IIS-0845647; and CRA 0937060