![Page 1: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/1.jpg)
1
A Formal Approach to Finding Explanations for Database Queries
Sudeepa RoyDan Suciu
University of Washington, Seattle
![Page 2: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/2.jpg)
2
We need to understand “Big Data”
ref. Big data whitepaper, Jagadish et al., 2011-12
D1
D2
D3
Data Analysis System
1. Acquire Data 2. Prepare Data 3. Store in DB
Clean Extract Feature Integrate
5. Plot Graphs6. Ask Questions! 4. Run Queries
Do you have an
explanation?
![Page 3: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/3.jpg)
3
• Why is there a peak for #sigmod papers from industry during 2000-06,
while #academia papers kept increasing?
• Why is #SIGMOD papers #PODS papers in UK?
Sample Questions
Dataset: Pre-processed DBLP + Affiliation dataDisclaimer: Not all authors have affiliation info
Explanations by our approach at the end
![Page 4: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/4.jpg)
4
“What was the cause of the observation?”• Not simple association or correlation• e.g. People having headache drink coffee Does coffee cause headache? Does headache lead to drinking coffee?
Ideal goal: Why Causality
![Page 5: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/5.jpg)
5
• Has been studied for many years (Hume1748)• Extensive study in AI over the last decade by Judea Pearl using the notion of intervention:
X is a cause of Y, if removal of X also removes Y keeping other conditions unchanged
• Needs controlled experiments • Not always possible with a database
But, causality is hard…
![Page 6: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/6.jpg)
6
Realistic Database-y goal: Why Explanation
Causality ExplanationControlled Experiment Input database and observed query outputs
Causal Paths PK-FK constraints and their generalization
Intervention Remove input tuples,query output should change
Top Causes Top explanations will change the outputin the expected direction to a greater extent
![Page 7: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/7.jpg)
7
Previous/Related Work• Causality in databases
– Meliou et al.’10, Meliou et al.’11
• Explanations in databases– Explaining outliers in aggregate queries: Wu-Madden’13– Specific applications (Map-Reduce, Access log, User Rating,…): e.g. Khoussainova et al.’12, Fabbri et al.’12, Das et al.’11
• Other related topics– Provenance, deletion propagation: e.g. Green et al.’07, Buneman et al.’01– Missing answer/Why-Not: e.g. Herschel et al.’09, Huang et al.’10, Chapman-Jagadish’09– Finding causal structure/data mining: e.g. Silverstein et al.’00– OLAP: e.g. Sarawagi-Sathe’01
• Informally use intervention• Explanation = predicate• Mostly single table, no join
• Pearl’s notion of causality and intervention• Causal structure from input to output by lineage• Cause = Individual input tuples, not predicates• No inherent causal structure in input data
Upcoming VLDB 2014 Tutorial“Causality and Explanations in Databases”Alexandra Meliou, Sudeepa Roy, Dan Suciu
This work:• Formal framework of explanations (= predicates) and theoretical analysis – causal structure within input data independent of
queries or user questions– allow multiple tables and joins
• Optimizations and Evaluation– find top explanations using data cube
![Page 8: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/8.jpg)
8
Outline
• Framework
• Causal Paths and Intervention
• Computing Intervention
• Optimization: Ranking Explanations by Data Cube
• Evaluation
• Future Work
![Page 9: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/9.jpg)
9
Input and Output
Run Group-By Queries and Plot
Toy DBLP database Output Plot
User question • Numerical expression E• Direction: high/low
E = (q1/q3) / (q2/q4)Direction = high
Why is q1/q3 q2/q4
e.g. q1
select count(distinct x.pubid)from Author x, Authored y, Publication zwhere x.id = y.id and y.pubid = z.pubid and z.venue = ’SIGMOD’ and 2000 <= z.year and z.year <= 2004 and x.domain = ’com’
These values will vary for q2, q3, q4
Input
Explanation(s) ф : Predicate on attributese.g.[name = ‘JG’][name = ‘JG’] [inst = ‘C.edu’][name = ‘JG’] [year = 2007]Note: attr from multiple tables
Output
E should change when database is “intervened “with
![Page 10: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/10.jpg)
10
Causal Paths by Foreign Key Constraints
• Causal path X Y: removing X removes Y• Analogy in DB: Foreign key constraints and cascade delete semantics
Author(id, name, inst, dom)
Authored(id, pubid)
Publication(pubid, year, venue)
Standard F.K.(cascade delete)
Back and Forth F.K.(cascade delete
+reverse cascade delete)ForwardReverse
Intuition: • An author can exist if one of her papers is deleted• A paper cannot exist if any of its co-authors is deleted
![Page 11: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/11.jpg)
11
Causal Paths and InterventionCandidate explanation ф : [name = ‘RR’]
ReverseForward
Intervention ф : Tuples T0 that satisfy ф + Tuples reachable from T0
Given ф, computation of ф requires a recursive query
Multiple tables
require universal
table
![Page 12: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/12.jpg)
12
Objective: top-k explanations• Consider user question: Why is E = (q1/q2)/(q3/q4) low,
Find top-k explanations ф w.r.t a score ф = E(D - ф)
The obvious approach: Two sources of complexity
1. For all possible predicates ф– Compute the intervention ф for ф
– Delete tuples in ф from D
– Evaluate q1, q2, q3, q4 on D – ф
– Compute E(D - ф)
2. Find top explanations with highest scores E(D - ф) (top-k)
Recursion
![Page 13: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/13.jpg)
13
Computing ф by a Recursive Program
Properties:1. Program has a unique least fixpoint which can be
obtained in poly-time (n = |D| steps)
2. Program is not monotone in database, • i.e., if D D’, not necessarily (D) (D’)• Therefore not expressible in datalog• But expressible in datalog + negation
Delete from universal table
tuples |= ф Cascade delete
Reverse Cascade delete
ф is fixed
![Page 14: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/14.jpg)
14
Convergence Depends on Schema
• Convergence in ≤ 4 steps
S
R T
• Convergence requiresn - 1 steps
Can be generalized
![Page 15: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/15.jpg)
15
Finding Top-k Explanations with Data CubeFor all possible predicates ф
– Compute the intervention ф for ф
– …..
#Possible predicates is huge
Running FOR LOOP is expensive
Running RECURSION is expensive
Optimization: OLAP data cube
why (q1*q4)/(q2*q3) high?
Suppose we want predicates on attributes [name, inst, venue] as explanations
group by name, inst, venuewith cube
name, inst, venue
e.g. Cube for q1e.g. Query for q1
![Page 16: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/16.jpg)
16
name Inst venue q1() - - - 1
RR - - 1
- M.com - 1
……..
q1: com, 2000-04 name Inst venue q2()
- - - 1
CM - - 1
- I.com - 1
……..
q2: com, 2007-11
name Inst venue q3()
- - - 1
JG - - 1
- C.Edu - 1
……..
q3: edu, 2000-04 name Inst venue q4()
- - - 1
JG - - 1
- C.Edu - 1
……..
q4: edu, 2007-11
Sketch of Algorithm with Data Cube
name Inst year E(D - D )
- - - JG - -RR - -- C.edu
1. (Outer)-join the cubes + compute score
Score• All computation done by DBMS
• But,− Cube Algorithm matches theory for some
inputs (e.g. single table, DBLP examples)
− For other inputs it is a heuristic (recursion is necessary)
2. Run Top-K
![Page 17: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/17.jpg)
17
Experiment 1: Data Cube Optimization vs. Iterative Algo
Natality Dataset 2010: (from National Center for Health Statistics (NCHS)). Single table with 233 attributes, ~4M entries, 2.89GB size.
More experiments in the paper
Data size vs. time
![Page 18: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/18.jpg)
18
Experiment 2: Scalability of the Data Cube Optimization
No. of attributes
E1E2
E1E2
E1 E2
# Aggregate queries in the User Question 2 4
# Data Cubes computed 2 4
# Joins of the Data Cubes to compute final table 1 3
Why (q1/q2) low Why (q1/q2)/(q3/q4) low
Data size vs. time(Max) No. of attributes in explanation predicates vs. time
![Page 19: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/19.jpg)
19
Qualitative Evaluation (DBLP)
Q. Why is there a peak for #sigmod papers from industry during 2000-06, while #academia papers kept increasing?
Intuition:1. If we remove these industrial labs and their senior researchers, the
peak during 2000-04 is more flattened2. If we remove these universities with relatively new but highly prolificdb groups, the curve for academia is less increasing
Hard due to lack of gold standard
![Page 20: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/20.jpg)
20
Qualitative Evaluation (DBLP)
Intuition:If we remove these leading theoretical DB researchers or their universities/cities,
the bar for UK will look different.
e.g. for UK, Originally: PODS = 62%, SIGMOD = 38%Removing all publications by Libkin: PODS = 46%, SIGMOD = 54%
Q. Why is #SIGMOD papers #PODS papers in UK?
P = 32, S = 3 P = 24, S = 1 P = 9, S = 0
P = 15, S = 2
source: DBLP
Not top expl.:Wenfei FanPeter Buneman…..
P = 15, S = 12
P = 6, S = 12
![Page 21: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/21.jpg)
21
Current/Future Work• Optimize for arbitrary SPJUA queries and schemas
– Go beyond data cube
• Model increasing/decreasing trend by linear regression (E = slope)• Ranking algorithm: simple, meaningful, diverse explanations• Prototype with a GUI
![Page 22: A Formal Approach to Finding Explanations for Database Queries](https://reader037.vdocuments.site/reader037/viewer/2022102819/56814363550346895dafdf4b/html5/thumbnails/22.jpg)
22
Thank you
Questions?