(artificially) intelligent systems for data science · 2019. 3. 8. · group by a.area, d.nat = us...
TRANSCRIPT
-
(Artificially) Intelligent Systems for Data Science
Sanjay Krishnan Computer Science, University of Chicago
-
id Masters Area Admitted?1 Y AI/ML Y
2 Y HCI N
3 Y Systems Y
4 N AI/ML N
Research Question: Are Admissions Fair to International Students?
id Gender Nat Ugrad1 M US MIT2 F CN Tsinghua3 M US UCB4 F UK Imperial
Area School Rate
AI/ML MIT 0.1HCI UCB 0.25
Systems CMU 0.26AI/ML U Chi 0.17
-
“Data” System
US Admission Rate: 0.16 Intl Admission Rate: 0.07
analyze_this(data)
-
Today’s Databases Are “Passive”
“Data” System
US Admission Rate: 0.16 Intl Admission Rate: 0.07
-
US Admission Rate: 0.16 Intl Admission Rate: 0.07
analyze_this(data)
The Case For “Active” Databases
Data Input Interface With Data
Collection and Hardware
Analysis Input Interface With Analyst
-
US Admission Rate: 0.16 Intl Admission Rate: 0.07
analyze_this(data)
The Case For “Active” Databases
Auto-tune for performance Pre-compute quantities that might be useful Identify bugs in analysis
-
System Controller
State x
Control u
Approximate Sequential Decision MakingEmpirical solutions to black-box control
Predict Best Action
Cost(x,u)
-
US Admission Rate: 0.16 Intl Admission Rate: 0.07
analyze_this(data)
The Case For “Active” Databases
Auto-tune for performance Pre-compute quantities that might be useful Identify bugs in analysis
-
Find Admission Rate, Sort By Biggest Difference With Historical Rate
SELECT AVG(a.Admitted) as arate, AVG(a.Admitted-s.Rate) as diff
FROM a, d, hWHERE a.id = d.id AND a.Area = h.Area AND h.School = ‘U Chi’GROUP BY a.Area, d.Nat = USORDER BY diff
Option 1. Scan admissions first and then lookup demographics and historical. Option 2. Build a lookup table on historical and admissions and then scan demographics. Option 3. Should I first sort historical by area/school and then combine the info. …
Database Self-OptimizationFind the best execution path for a SQL query.
id Masters Area Admitted?
1 Y AI/ML Y
2 Y HCI N
3 Y Systems Y
4 N AI/ML N
id Gender Nat Ugrad
1 M US MIT
2 F CN Tsinghua
3 M US UCB
4 F UK Imperial
Area School Rate
AI/ML MIT 0.1
HCI UCB 0.25
Systems CMU 0.26
AI/ML U Chi 0.17
-
Todays Query Optimization
Plan
agg(Join( {a, d, h} ))
10*rows_created + 0.3*mem_used + min(network, memory)
SELECT AVG(a.Admitted) as arate, AVG(a.Admitted-s.Rate) as diff
FROM a, d, hWHERE a.id = d.id AND a.Area = h.Area AND h.School = ‘U Chi’GROUP BY a.Area, d.Nat = USORDER BY diff
1200 120Cost Model
Option 1 Option 2
-
Actual Runtime
Let the system experiment!
p1, p1’
The system strategically runs experiments where it believes the cost estimate is inaccurate.
-
Optimal Plans Are Composed of Optimal Sub-plans
Join( {a, d, h} )
Best( {a} )
Best( {d} )
Best( {h} )
Best( {a, d} )
Best( {d, h} )
Best( {a, h} ) Best( {a, d, h} )
-
Reduces Planning To Statistical Estimation
“Predict how valuable is it to make a certain subplan.”
approx_q(s,a)
Subprograms Sequences Estimate Long Term ValueFn Approximator
(AD, H(AD), 5.14) (AH, (AH)D, 10.96) … (DH, (DH)A, 1.45)
-
Real Systems/WorkloadsJoin( {E, S, T} )
Best( {E} )
Best( {S} )
Best( {T} )
Best( {E, S} )
Best( {S, T} )
Best( {E, T} ) Best( {E, S, T} )
e: 142
e: 131
e: 136 e: 216
e: 341
e: 340
e: 540
𝜋*(s) = arg min q(current, next)
IMDB Workload Planning Time Execution Time
Postgres SQL Up to 32.4x Faster Avg.1.12x Faster
Apache Spark Up to 276x Faster Avg.3.02x Faster
-
US Admission Rate: 0.16 Intl Admission Rate: 0.07
analyze_this(data)
The Case For “Active” Databases
Auto-tune for performance Pre-compute quantities that might be useful Identify bugs in analysis
-
SELECT AVG(a.Admitted) as arate, AVG(a.Admitted-s.Rate) as diff
FROM a, d, hWHERE a.id = d.id AND a.Area = h.Area AND h.School = ‘U Chi’GROUP BY a.Area, d.Nat = USORDER BY diff
Opportunistic MaterializationPreemptively compute important results
-
Opportunistic MaterializationPreemptively compute important results
Res 1
Res 2
Res 3
Candidate Results
Creation/ Deletion System
Decision at 10:59pm Net Benefit 201Decision at 11:19pm Net Benefit -100Decision at 11:27pm Net Benefit -120
-
Reduces Planning To Statistical Estimation
“Predict how valuable is it to persist a result in the current state of the database.”
approx_q(s,a)
Completed Experiments Estimate Long Term ValueFn Approximator
(Q0, db_state_0, +5.14) (Q1, db_state_1, -1.02) …
(QN, db_state_N, +15.34)
-
Simple Caching
Complete ForesightUS
10x over all runtime reduction
-
US Admission Rate: 0.16 Intl Admission Rate: 0.07
analyze_this(data)
The Case For “Active” Databases
Auto-tune for performance Pre-compute quantities that might be useful Identify bugs in analysis
-
What does this result tell us about the dataset?
US Admission Rate: 0.16 Intl Admission Rate: 0.07
id Masters Area Admitted?1 Y AI/ML Y
2 Y HCI N
3 Y Systems Y
4 N AI/ML N
Actually…not muchUS Admissions > Intl Admission in Each Area
US Admissions >> Intl Admission in a few Areas
US Admissions < Intl Admission in Each Area, but Intl is over-represented in selective areas
-
Automatically Synthesize Explanations
US Admission Rate: 0.16 Intl Admission Rate: 0.07
SELECT AVG(a.Admitted) as arate, FROM a, dWHERE a.id = d.idGROUP BY d.Nat = US
id Masters Area Admitted?
1 Y AI/ML Y
2 Y HCI N
3 Y Systems Y
4 N AI/ML N
US Admissions > Intl Admission in Each AreaUS Admissions >> Intl Admission in a few AreasUS Admissions < Intl Admission in Each Area, but Intl is over-represented in selective areas
Automatic Search For Confounding Variables
-
US Not US
>P[ad
mit]
US Not US
Us.Area = Intl.Area
Want analyses to be homogenous
-
Automatically Synthesize Explanations
US Admission Rate: 0.16 Intl Admission Rate: 0.07
SELECT AVG(a.Admitted) as arate, FROM a, dWHERE a.id = d.idGROUP BY d.Nat = US
id Masters Area Admitted?
1 Y AI/ML Y
2 Y HCI N
3 Y Systems Y
4 N AI/ML N
Quickly enumerate predicates that cause disagreements
-
US Not US
X{ (u,i, >/
-
US Not US
US.Area = Intl.Area
-
Automatically Synthesize Explanations
US Admission Rate: 0.16 Intl Admission Rate: 0.07
SELECT AVG(a.Admitted) as arate, FROM a, dWHERE a.id = d.idGROUP BY d.Nat = US
id Masters Area Admitted?
1 Y AI/ML Y
2 Y HCI N
3 Y Systems Y
4 N AI/ML N
Fit
Check
Remove
-
1
10
100
1000
10000
100000
Q10a Q16b Q21c Q31a
Seco
nds
1 GB Dataset, Data Science Benchmark
Naïve Opt Opt+ML
-
US Admission Rate: 0.16 Intl Admission Rate: 0.07
analyze_this(data)
The Case For “Active” Databases
Auto-tune for performance Pre-compute quantities that might be useful Identify bugs in analysis