jan. 29, 2020 dbms + ml josh sennett julian oks

63
DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020

Upload: others

Post on 04-Jul-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

DBMS + MLJulian Oks

Josh SennettJan. 29, 2020

Page 2: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Context + Problem Statement

Page 3: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Context: DBMS + ML

DBMS / RDBMS- Prevalent in all industries

- Efficient

- Highly reliable & available

- Consistent & Transactional

- Provide concurrency

- Declarative API (SQL)

- Support for versioning, auditing, encryption

Page 4: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Context: DBMS + ML

Machine Learning- Hardware is expensive and resource usage is non-uniform, but cloud

computing makes it affordable

- Data science expertise is expensive too, but ML services and tools aim to

make it accessible

- Becoming more mainstream; no longer exclusive to “unicorn” ML applications

- Growing focus on fairness, security, privacy, and auditability

“Typical applications of ML are built by smaller, less experienced teams, yet they have more stringent demands.”

Page 5: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Context: DBMS + ML

Challenges faced in employing machine learning:

- Many ML frameworks- Many ML algorithms- Heterogeneous hardware + runtime environments- Complex APIs -- typically requiring data science experts- Complexity of model selection, training, deployment, and improvement- Lack of security, auditability, versioning- Majority of time is spent on data cleaning, preprocessing, and integration

Can DBMS + ML integration address these challenges?

Page 6: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Life without DBMS + ML integration

1. Join tables, export, create features, and do offline ML

Problem: slow, memory + runtime intensive, redundant work

2. Write super-complex SQL UDFs to implement ML models

Problem: large, complex models are hard and often impossible to implement in SQL. Writing models from scratch is expensive and does not take advantage of existing ML frameworks and algorithms

Page 7: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Recent trends in DBMS + ML

Page 8: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Recent trends in DBMS + ML

Page 9: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Problem StatementSome Big Picture Questions:

- How do you make ML accessible to a typical database user? - How do you provide flexibility to use the right ML model?- How do you support different frameworks, cloud service providers, and

hardware and runtime environments?- Data is often split across tables; can we do prediction without needing to

materialize joins across these tables?- Can DBMS + ML efficiency match (or outperform) ML alone?

Tradeoff: simplicity vs. flexibility

Page 10: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks
Page 11: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Rafiki: Motivation

Building an ML application is complex, even when using cloud platforms and popular frameworks.

- Training:- Many models to choose from- Many (critical) hyperparameters to tune

- Inference:- Using a single model is faster, but less accurate than

using a model ensemble- Need to select the right model(s) to trade off accuracy

and latency

Rafiki

Page 12: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Rafiki: Approach

Training Service: automate model selection and hyperparameter tuning

- automated search to find the “best” model- highest accuracy- lowest memory usage - lowest runtime

Inference Service: online ensemble modeling

- automated selection of the “best” model (or ensemble) - maximize accuracy- minimize excess latency (time exceeding SLO)

Page 13: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Automate development and training of a new ML model:- distributed model selection- distributed hyper-parameter tuning

Page 14: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Rafiki: Training Service

Automate development and training of a new ML model:- distributed model selection- distributed hyper-parameter tuning

Page 15: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Rafiki: Training Service

Automate development and training of a new ML model:- distributed model selection- distributed hyper-parameter tuning

Rafiki parallelizes hyper-parameter tuning to reduce tuning time

Page 16: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Rafiki: Inference Service

Automate model selection and scheduling:

- Maximize accuracy- Minimize latency exceeding SLO

These are typically competing objectives

Other optimizations (similar to Clipper):

- Parallel ensemble modeling- Batch size selection (throughput vs latency)- Parameter caching

Model ensembles improve accuracy, but are slower due to stragglers

Page 17: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Rafiki: Inference Service

How does Rafiki optimize model selection for its inference service?

- Pose as optimization problem- Reinforcement Learning- Objective is to maximize

accuracy while minimizing excess latency (beyond SLO)

Page 18: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Rafiki: How does this integrate with DBMS?

- REST API via SQL UDF

CREATE FUNCTION food_name(image_path) RETURNS textASBEGIN ... -- CALL RAFIKI API -- ...END;

SELECTfood_name(image_path) AS name, count(*)FROM foodlogWHERE age > 52GROUP BY name;

- Python SDK

Page 19: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Interfaces

Web Interface Python UDF & Web API

Page 20: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Strengths & Weaknesses: RafikiStrengths:

- Allows users to specify an SLO that’s used for model selection and inference- Easy to use for large class of tasks (ie regression, classification, NLP)- Automates and optimizes complex decisions in ML design and deployment

Weaknesses:

- Not very general: you have to use Rafiki’s limited set of models, model selection + tuning algorithms, and model ensemble strategy

- Could be very expensive, since you have to train many models to find best- Rafiki has to compete with other offerings for automated model tuning and

model selection services

Page 21: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Questions: Rafiki

Page 22: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks
Page 23: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

MotivationDBMSs have many advantages

● High Performance● Mature and Reliable● High availability, Transactional, Concurrent Access, ...● Encryption, Auditing, Familiar & Prevalent, ….

Store and serve models in the DBMS

Question: Can in-RDBMS scoring of ML models match (outperform?) the performance of dedicated frameworks?

Answer: Yes!

Page 24: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

RavenSupports in-DB model inference

Key Features:

● Unified IR enables advanced cross-optimizations between ML and DB operators

● Takes advantage of ML runtimes integrated into Microsoft SQL Server Raven

Page 25: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Background: ONNX (Open Neural Network Exchange)

Standardized ML model representation

Enables portable models, cross-platform inference

Integrated ONNX Runtime into SQL Server

Page 26: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Background: MLflow Model PipelinesModel Pipeline contains:

● Trained Model ● Preprocessing Steps● Dependencies

Page 27: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Background: MLflow Model PipelinesModel Pipeline contains:

● Trained Model ● Preprocessing Steps● Dependencies

Pipeline is stored in the RDBMS

A SQL query can invoke a Model Pipeline

Page 28: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Defining a Model Pipeline

Page 29: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Using a Model Pipeline

Page 30: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Raven’s IR

Uses both ML and relational operators

Page 31: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Raven’s IROperator categories:

● Relational Algebra● Linear Algebra● Other ML and data featurizers (eg scikit-learn operations)● UDFs, for when the static analyzer is unable to map operators

Page 32: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Cross-Optimization

Page 33: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Cross-OptimizationDBMS + ML optimizations:

● Predicate-based model pruning: conditions are pushed into the decision tree, allowing subtrees to be pruned

● Model-projection pushdown: unused features are discarded in the query plan● NN translation: Transform operations/models into equivalent NNs, leverage

optimized runtime● Model / query splitting: Model can be partitioned ● Model inlining: ML operators transformed to relational ones (eg small decision

trees can be inlined)

Other standard DB and compiler optimizations (constant folding, join elimination, …)

Page 34: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Runtime Code Generation

A new SQL query is generated from the optimized IR, and invoked on the integrated SQL Server+ONNX Runtime engine.

Page 35: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Raven: Full Pipeline

Page 36: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Query ExecutionThe generated SQL query is executed as either,

1. In-process execution (Raven): uses the integrated PREDICT statement2. Out-of-process execution (Raven Ext): Unsupported pipelines are executed

as an external script3. Containerized execution: if unable to run Raven Ext., run within Docker

Page 37: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

ResultsEffects of some optimizations:

Page 38: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Results: does Raven outperform ONNX?

Page 39: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Strengths & Weaknesses: RavenStrengths:

- Brings advantages of a DBMS to machine learning models- Raven’s cross-optimizations and ONNX integration make inference faster- Very generalized: natively supports many ML frameworks, runtimes,

specialized hardware, and provides containerized execution for all others

Weaknesses:

- Only compatible with SQL Server - Limited to inference; it does not facilitate training- Limits to static analysis (eg analysis of for loops & conditionals)

Page 40: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Questions: Raven

Page 41: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks
Page 42: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Motivation: a typical ML pipelineCollect Data, Materialize Joins ML / LA (Linear Algebra) Operations

Page 43: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Morpheus: Factorized MLIdea: avoid materializing joins by pushing ML computations through joins

Any Linear Algebra computation over the join output can be factorized in terms of

the base tables

Morpheus (2017): Factor operations over a “normalized matrix” using a framework of rewrite rules.

Morpheus

Page 44: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

The Normalized MatrixA logical data type that represents joined tables

Consider a PK-FK join between two tables S, R

The normalized matrix is the triple (S, K, R)

Where K is an indicator matrix,

The output of the join, T, is then T = [S, KR] (column-wise concatenation)

KeyS: left tableR: right tableK: indicator matrix (0/1s)T: normalized matrix

Page 45: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Rewrite Rules● Element-wise operators: f(T) → (f(S), K, f(R))● Aggregators:

○ rowSum(T) → rowSum(S) + K rowSum(R)○ colSum(T) → [colSum(S), colSum(K)R]○ sum(T) → sum(S) + colSum(K) rowSum(R)

● Left Matrix Multiplication: TX → SX[1 : dS, ] + K(RX[dS + 1 : d, ])● Matrix Inversion:

KeyS: left tableR: right tableK: indicator matrix (0/1s)T: normalized matrix

Page 46: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Rewrite Rules Applied to Logistic RegressionKey

S: left tableR: right tableK: indicator matrix (0/1s)T: normalized matrix

Page 47: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

PerformanceKey

F: FactorizedM: MaterializedFR: feature ratioTR: tuple ratio

Page 48: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

PerformanceKey

Domain Size - # of unique values

Page 49: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

MorpheusFI: Quadratic Feature Interactions for Morpheus

Limitation: Factorized Linear Algebra is restricted to linearity over feature vectors

Page 50: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

MorpheusFIAdd quadratic feature interactions into factorized ML by adding two non-linear interaction operators:

● self-interaction within a matrix● cross-interaction between matrices

participating in a join

Page 51: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

A new abstraction: Interacted Normalized Matrix with the following relationships:

Page 52: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Formal proofs of algebraic rewrite rules

Page 53: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Rewrite rules are extremely complex

Page 54: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Results: Matrix multiplicationKey

LMM: Left matrix multRMM: Right matrix multp: # of joined tablesq: # of sparse dimension tables

Page 55: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Results: time to convergence

Page 56: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Strengths & Weaknesses: Morpheus / MorpheusFIStrengths:

- Automatically rewrite any LA computation over a join’s output as LA operations over the base tables

- Decouples factorization and execution; backend-agnostic- In many cases, it can dramatically improve runtime- MorpheusFI extends Morpheus to support quadratic feature interactions

Weaknesses:

- It cannot be generalized to support higher degree interactions- At the time of the publication, only a simple heuristics-based approach to

optimizing the execution plan- Only supports ML models that can be expressed in linear algebraic terms

Page 57: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Questions: Morpheus / MorpheusFI

Page 58: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Discussion

Page 59: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

CommonalitiesOverall objectives:

- Empower database users to use ML from their DBMS- Avoid the high cost of doing “offline” ML- Aim for flexibility + generalizability- Improve efficiency

Optimization via Translation:

- Raven: Translate data and ML operations into a unified IR

- MorpheusFI: Translate ML models into LA operations

Rafiki

Raven

MorpheusFI

Page 60: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

DifferencesImplementation:

- Raven: major modifications to DBMS engine- Rafiki: cloud application, many interfaces- MorpheusFI: lightweight Python library

Generalizability:

- MorpheusFI: works for any ML model built with LA operators with linear or quadratic feature interactions

- Raven: native support for many popular models and frameworks, and out-of-process execution for all others

- Rafiki: only supports a limited set of models + runtimes

Rafiki

Raven

Morpheus

Page 61: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

DifferencesInference vs Training:

- Raven: training in the cloud, inference in the DBMS- Rafiki: training & inference in the cloud, with a

DBMS interface

Rafiki

Raven

Morpheus

Page 62: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Questions & Discussion

Page 63: Jan. 29, 2020 DBMS + ML Josh Sennett Julian Oks

Questions & Discussion● How will ML be affected by stricter data governance? How can DBs play a

role?● What challenges remain (technical or not) for applying ML in an enterprise

setting?● What challenges can the DBs solve? ● What's the role of the DBs in ML?● Reasons not to invest in DBMS + ML?