(artificially) intelligent systems for data science · 2019. 3. 8. · group by a.area, d.nat = us...

(Artificially) Intelligent Systems for Data Science

Sanjay Krishnan Computer Science, University of Chicago

id Masters Area Admitted?1 Y AI/ML Y

2 Y HCI N

3 Y Systems Y

4 N AI/ML N

Research Question: Are Admissions Fair to International Students?

id Gender Nat Ugrad1 M US MIT2 F CN Tsinghua3 M US UCB4 F UK Imperial

Area School Rate

AI/ML MIT 0.1HCI UCB 0.25

Systems CMU 0.26AI/ML U Chi 0.17

“Data” System

US Admission Rate: 0.16 Intl Admission Rate: 0.07

analyze_this(data)

Today’s Databases Are “Passive”

“Data” System



analyze_this(data)

The Case For “Active” Databases

Data Input Interface With Data

Collection and Hardware

Analysis Input Interface With Analyst


analyze_this(data)


Auto-tune for performance Pre-compute quantities that might be useful Identify bugs in analysis

System Controller

State x

Control u

Approximate Sequential Decision MakingEmpirical solutions to black-box control

Predict Best Action

Cost(x,u)


analyze_this(data)



Find Admission Rate, Sort By Biggest Difference With Historical Rate

SELECT AVG(a.Admitted) as arate, AVG(a.Admitted-s.Rate) as diff

FROM a, d, hWHERE a.id = d.id AND a.Area = h.Area AND h.School = ‘U Chi’GROUP BY a.Area, d.Nat = USORDER BY diff

Option 1. Scan admissions first and then lookup demographics and historical. Option 2. Build a lookup table on historical and admissions and then scan demographics. Option 3. Should I first sort historical by area/school and then combine the info. …

Database Self-OptimizationFind the best execution path for a SQL query.

id Masters Area Admitted?

1 Y AI/ML Y

2 Y HCI N

3 Y Systems Y

4 N AI/ML N

id Gender Nat Ugrad

1 M US MIT

2 F CN Tsinghua

3 M US UCB

4 F UK Imperial

Area School Rate

AI/ML MIT 0.1

HCI UCB 0.25

Systems CMU 0.26

AI/ML U Chi 0.17

Todays Query Optimization

Plan

agg(Join( {a, d, h} ))

10*rows_created + 0.3*mem_used + min(network, memory)



1200 120Cost Model

Option 1 Option 2

Actual Runtime

Let the system experiment!

p1, p1’

The system strategically runs experiments where it believes the cost estimate is inaccurate.

Optimal Plans Are Composed of Optimal Sub-plans

Join( {a, d, h} )

Best( {a} )

Best( {d} )

Best( {h} )

Best( {a, d} )

Best( {d, h} )

Best( {a, h} ) Best( {a, d, h} )

Reduces Planning To Statistical Estimation

“Predict how valuable is it to make a certain subplan.”

approx_q(s,a)

Subprograms Sequences Estimate Long Term ValueFn Approximator

(AD, H(AD), 5.14) (AH, (AH)D, 10.96) … (DH, (DH)A, 1.45)

Real Systems/WorkloadsJoin( {E, S, T} )

Best( {E} )

Best( {S} )

Best( {T} )

Best( {E, S} )

Best( {S, T} )

Best( {E, T} ) Best( {E, S, T} )

e: 142

e: 131

e: 136 e: 216

e: 341

e: 340

e: 540

𝜋*(s) = arg min q(current, next)

IMDB Workload Planning Time Execution Time

Postgres SQL Up to 32.4x Faster Avg.1.12x Faster

Apache Spark Up to 276x Faster Avg.3.02x Faster


analyze_this(data)





Opportunistic MaterializationPreemptively compute important results

Opportunistic MaterializationPreemptively compute important results

Res 1

Res 2

Res 3

Candidate Results

Creation/ Deletion System

Decision at 10:59pm Net Benefit 201Decision at 11:19pm Net Benefit -100Decision at 11:27pm Net Benefit -120

Reduces Planning To Statistical Estimation

“Predict how valuable is it to persist a result in the current state of the database.”

approx_q(s,a)

Completed Experiments Estimate Long Term ValueFn Approximator

(Q0, db_state_0, +5.14) (Q1, db_state_1, -1.02) …

(QN, db_state_N, +15.34)

Simple Caching

Complete ForesightUS

10x over all runtime reduction


analyze_this(data)



What does this result tell us about the dataset?


id Masters Area Admitted?1 Y AI/ML Y

2 Y HCI N

3 Y Systems Y

4 N AI/ML N

Actually…not muchUS Admissions > Intl Admission in Each Area

US Admissions >> Intl Admission in a few Areas

US Admissions < Intl Admission in Each Area, but Intl is over-represented in selective areas

Automatically Synthesize Explanations


SELECT AVG(a.Admitted) as arate, FROM a, dWHERE a.id = d.idGROUP BY d.Nat = US


1 Y AI/ML Y

2 Y HCI N

3 Y Systems Y

4 N AI/ML N

US Admissions > Intl Admission in Each AreaUS Admissions >> Intl Admission in a few AreasUS Admissions < Intl Admission in Each Area, but Intl is over-represented in selective areas

Automatic Search For Confounding Variables

US Not US

>P[ad

mit]

US Not US

Us.Area = Intl.Area

Want analyses to be homogenous





1 Y AI/ML Y

2 Y HCI N

3 Y Systems Y

4 N AI/ML N

Quickly enumerate predicates that cause disagreements

US Not US

X{ (u,i, >/

US Not US

US.Area = Intl.Area





1 Y AI/ML Y

2 Y HCI N

3 Y Systems Y

4 N AI/ML N

Fit

Check

Remove

1

10

100

1000

10000

100000

Q10a Q16b Q21c Q31a

Seco

nds

1 GB Dataset, Data Science Benchmark

Naïve Opt Opt+ML


analyze_this(data)



(artificially) intelligent systems for data science · 2019. 3. 8. · group by a.area, d.nat = us...

Documents