(artificially) intelligent systems for data science · 2019. 3. 8. · group by a.area, d.nat = us...

30
(Artificially) Intelligent Systems for Data Science Sanjay Krishnan Computer Science, University of Chicago

Upload: others

Post on 04-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • (Artificially) Intelligent Systems for Data Science

    Sanjay Krishnan Computer Science, University of Chicago

  • id Masters Area Admitted?1 Y AI/ML Y

    2 Y HCI N

    3 Y Systems Y

    4 N AI/ML N

    Research Question: Are Admissions Fair to International Students?

    id Gender Nat Ugrad1 M US MIT2 F CN Tsinghua3 M US UCB4 F UK Imperial

    Area School Rate

    AI/ML MIT 0.1HCI UCB 0.25

    Systems CMU 0.26AI/ML U Chi 0.17

  • “Data” System

    US Admission Rate: 0.16 Intl Admission Rate: 0.07

    analyze_this(data)

  • Today’s Databases Are “Passive”

    “Data” System

    US Admission Rate: 0.16 Intl Admission Rate: 0.07

  • US Admission Rate: 0.16 Intl Admission Rate: 0.07

    analyze_this(data)

    The Case For “Active” Databases

    Data Input Interface With Data

    Collection and Hardware

    Analysis Input Interface With Analyst

  • US Admission Rate: 0.16 Intl Admission Rate: 0.07

    analyze_this(data)

    The Case For “Active” Databases

    Auto-tune for performance Pre-compute quantities that might be useful Identify bugs in analysis

  • System Controller

    State x

    Control u

    Approximate Sequential Decision MakingEmpirical solutions to black-box control

    Predict Best Action

    Cost(x,u)

  • US Admission Rate: 0.16 Intl Admission Rate: 0.07

    analyze_this(data)

    The Case For “Active” Databases

    Auto-tune for performance Pre-compute quantities that might be useful Identify bugs in analysis

  • Find Admission Rate, Sort By Biggest Difference With Historical Rate

    SELECT AVG(a.Admitted) as arate, AVG(a.Admitted-s.Rate) as diff

    FROM a, d, hWHERE a.id = d.id AND a.Area = h.Area AND h.School = ‘U Chi’GROUP BY a.Area, d.Nat = USORDER BY diff

    Option 1. Scan admissions first and then lookup demographics and historical. Option 2. Build a lookup table on historical and admissions and then scan demographics. Option 3. Should I first sort historical by area/school and then combine the info. …

    Database Self-OptimizationFind the best execution path for a SQL query.

    id Masters Area Admitted?

    1 Y AI/ML Y

    2 Y HCI N

    3 Y Systems Y

    4 N AI/ML N

    id Gender Nat Ugrad

    1 M US MIT

    2 F CN Tsinghua

    3 M US UCB

    4 F UK Imperial

    Area School Rate

    AI/ML MIT 0.1

    HCI UCB 0.25

    Systems CMU 0.26

    AI/ML U Chi 0.17

  • Todays Query Optimization

    Plan

    agg(Join( {a, d, h} ))

    10*rows_created + 0.3*mem_used + min(network, memory)

    SELECT AVG(a.Admitted) as arate, AVG(a.Admitted-s.Rate) as diff

    FROM a, d, hWHERE a.id = d.id AND a.Area = h.Area AND h.School = ‘U Chi’GROUP BY a.Area, d.Nat = USORDER BY diff

    1200 120Cost Model

    Option 1 Option 2

  • Actual Runtime

    Let the system experiment!

    p1, p1’

    The system strategically runs experiments where it believes the cost estimate is inaccurate.

  • Optimal Plans Are Composed of Optimal Sub-plans

    Join( {a, d, h} )

    Best( {a} )

    Best( {d} )

    Best( {h} )

    Best( {a, d} )

    Best( {d, h} )

    Best( {a, h} ) Best( {a, d, h} )

  • Reduces Planning To Statistical Estimation

    “Predict how valuable is it to make a certain subplan.”

    approx_q(s,a)

    Subprograms Sequences Estimate Long Term ValueFn Approximator

    (AD, H(AD), 5.14) (AH, (AH)D, 10.96) … (DH, (DH)A, 1.45)

  • Real Systems/WorkloadsJoin( {E, S, T} )

    Best( {E} )

    Best( {S} )

    Best( {T} )

    Best( {E, S} )

    Best( {S, T} )

    Best( {E, T} ) Best( {E, S, T} )

    e: 142

    e: 131

    e: 136 e: 216

    e: 341

    e: 340

    e: 540

    𝜋*(s) = arg min q(current, next)

    IMDB Workload Planning Time Execution Time

    Postgres SQL Up to 32.4x Faster Avg.1.12x Faster

    Apache Spark Up to 276x Faster Avg.3.02x Faster

  • US Admission Rate: 0.16 Intl Admission Rate: 0.07

    analyze_this(data)

    The Case For “Active” Databases

    Auto-tune for performance Pre-compute quantities that might be useful Identify bugs in analysis

  • SELECT AVG(a.Admitted) as arate, AVG(a.Admitted-s.Rate) as diff

    FROM a, d, hWHERE a.id = d.id AND a.Area = h.Area AND h.School = ‘U Chi’GROUP BY a.Area, d.Nat = USORDER BY diff

    Opportunistic MaterializationPreemptively compute important results

  • Opportunistic MaterializationPreemptively compute important results

    Res 1

    Res 2

    Res 3

    Candidate Results

    Creation/ Deletion System

    Decision at 10:59pm Net Benefit 201Decision at 11:19pm Net Benefit -100Decision at 11:27pm Net Benefit -120

  • Reduces Planning To Statistical Estimation

    “Predict how valuable is it to persist a result in the current state of the database.”

    approx_q(s,a)

    Completed Experiments Estimate Long Term ValueFn Approximator

    (Q0, db_state_0, +5.14) (Q1, db_state_1, -1.02) …

    (QN, db_state_N, +15.34)

  • Simple Caching

    Complete ForesightUS

    10x over all runtime reduction

  • US Admission Rate: 0.16 Intl Admission Rate: 0.07

    analyze_this(data)

    The Case For “Active” Databases

    Auto-tune for performance Pre-compute quantities that might be useful Identify bugs in analysis

  • What does this result tell us about the dataset?

    US Admission Rate: 0.16 Intl Admission Rate: 0.07

    id Masters Area Admitted?1 Y AI/ML Y

    2 Y HCI N

    3 Y Systems Y

    4 N AI/ML N

    Actually…not muchUS Admissions > Intl Admission in Each Area

    US Admissions >> Intl Admission in a few Areas

    US Admissions < Intl Admission in Each Area, but Intl is over-represented in selective areas

  • Automatically Synthesize Explanations

    US Admission Rate: 0.16 Intl Admission Rate: 0.07

    SELECT AVG(a.Admitted) as arate, FROM a, dWHERE a.id = d.idGROUP BY d.Nat = US

    id Masters Area Admitted?

    1 Y AI/ML Y

    2 Y HCI N

    3 Y Systems Y

    4 N AI/ML N

    US Admissions > Intl Admission in Each AreaUS Admissions >> Intl Admission in a few AreasUS Admissions < Intl Admission in Each Area, but Intl is over-represented in selective areas

    Automatic Search For Confounding Variables

  • US Not US

    >P[ad

    mit]

    US Not US

    Us.Area = Intl.Area

    Want analyses to be homogenous

  • Automatically Synthesize Explanations

    US Admission Rate: 0.16 Intl Admission Rate: 0.07

    SELECT AVG(a.Admitted) as arate, FROM a, dWHERE a.id = d.idGROUP BY d.Nat = US

    id Masters Area Admitted?

    1 Y AI/ML Y

    2 Y HCI N

    3 Y Systems Y

    4 N AI/ML N

    Quickly enumerate predicates that cause disagreements

  • US Not US

    X{ (u,i, >/

  • US Not US

    US.Area = Intl.Area

  • Automatically Synthesize Explanations

    US Admission Rate: 0.16 Intl Admission Rate: 0.07

    SELECT AVG(a.Admitted) as arate, FROM a, dWHERE a.id = d.idGROUP BY d.Nat = US

    id Masters Area Admitted?

    1 Y AI/ML Y

    2 Y HCI N

    3 Y Systems Y

    4 N AI/ML N

    Fit

    Check

    Remove

  • 1

    10

    100

    1000

    10000

    100000

    Q10a Q16b Q21c Q31a

    Seco

    nds

    1 GB Dataset, Data Science Benchmark

    Naïve Opt Opt+ML

  • US Admission Rate: 0.16 Intl Admission Rate: 0.07

    analyze_this(data)

    The Case For “Active” Databases

    Auto-tune for performance Pre-compute quantities that might be useful Identify bugs in analysis