jan. 29, 2020 dbms + ml josh sennett julian oks

DBMS + MLJulian Oks

Josh SennettJan. 29, 2020

Context + Problem Statement

Context: DBMS + ML

DBMS / RDBMS- Prevalent in all industries

- Efficient

- Highly reliable & available

- Consistent & Transactional

- Provide concurrency

- Declarative API (SQL)

- Support for versioning, auditing, encryption

Context: DBMS + ML

Machine Learning- Hardware is expensive and resource usage is non-uniform, but cloud

computing makes it affordable

- Data science expertise is expensive too, but ML services and tools aim to

make it accessible

- Becoming more mainstream; no longer exclusive to “unicorn” ML applications

- Growing focus on fairness, security, privacy, and auditability

“Typical applications of ML are built by smaller, less experienced teams, yet they have more stringent demands.”

Context: DBMS + ML

Challenges faced in employing machine learning:

- Many ML frameworks- Many ML algorithms- Heterogeneous hardware + runtime environments- Complex APIs -- typically requiring data science experts- Complexity of model selection, training, deployment, and improvement- Lack of security, auditability, versioning- Majority of time is spent on data cleaning, preprocessing, and integration

Can DBMS + ML integration address these challenges?

Life without DBMS + ML integration

1. Join tables, export, create features, and do offline ML

Problem: slow, memory + runtime intensive, redundant work

2. Write super-complex SQL UDFs to implement ML models

Problem: large, complex models are hard and often impossible to implement in SQL. Writing models from scratch is expensive and does not take advantage of existing ML frameworks and algorithms

Recent trends in DBMS + ML

Problem StatementSome Big Picture Questions:

- How do you make ML accessible to a typical database user? - How do you provide flexibility to use the right ML model?- How do you support different frameworks, cloud service providers, and

hardware and runtime environments?- Data is often split across tables; can we do prediction without needing to

materialize joins across these tables?- Can DBMS + ML efficiency match (or outperform) ML alone?

Tradeoff: simplicity vs. flexibility

Rafiki: Motivation

Building an ML application is complex, even when using cloud platforms and popular frameworks.

- Training:- Many models to choose from- Many (critical) hyperparameters to tune

- Inference:- Using a single model is faster, but less accurate than

using a model ensemble- Need to select the right model(s) to trade off accuracy

and latency

Rafiki

Rafiki: Approach

Training Service: automate model selection and hyperparameter tuning

- automated search to find the “best” model- highest accuracy- lowest memory usage - lowest runtime

Inference Service: online ensemble modeling

- automated selection of the “best” model (or ensemble) - maximize accuracy- minimize excess latency (time exceeding SLO)

Automate development and training of a new ML model:- distributed model selection- distributed hyper-parameter tuning

Rafiki: Training Service


Rafiki: Training Service


Rafiki parallelizes hyper-parameter tuning to reduce tuning time

Rafiki: Inference Service

Automate model selection and scheduling:

- Maximize accuracy- Minimize latency exceeding SLO

These are typically competing objectives

Other optimizations (similar to Clipper):

- Parallel ensemble modeling- Batch size selection (throughput vs latency)- Parameter caching

Model ensembles improve accuracy, but are slower due to stragglers

Rafiki: Inference Service

How does Rafiki optimize model selection for its inference service?

- Pose as optimization problem- Reinforcement Learning- Objective is to maximize

accuracy while minimizing excess latency (beyond SLO)

Rafiki: How does this integrate with DBMS?

- REST API via SQL UDF

CREATE FUNCTION food_name(image_path) RETURNS textASBEGIN ... -- CALL RAFIKI API -- ...END;

SELECTfood_name(image_path) AS name, count(*)FROM foodlogWHERE age > 52GROUP BY name;

- Python SDK

Interfaces

Web Interface Python UDF & Web API

Strengths & Weaknesses: RafikiStrengths:

- Allows users to specify an SLO that’s used for model selection and inference- Easy to use for large class of tasks (ie regression, classification, NLP)- Automates and optimizes complex decisions in ML design and deployment

Weaknesses:

- Not very general: you have to use Rafiki’s limited set of models, model selection + tuning algorithms, and model ensemble strategy

- Could be very expensive, since you have to train many models to find best- Rafiki has to compete with other offerings for automated model tuning and

model selection services

Questions: Rafiki

MotivationDBMSs have many advantages

● High Performance● Mature and Reliable● High availability, Transactional, Concurrent Access, ...● Encryption, Auditing, Familiar & Prevalent, ….

Store and serve models in the DBMS

Question: Can in-RDBMS scoring of ML models match (outperform?) the performance of dedicated frameworks?

Answer: Yes!

RavenSupports in-DB model inference

Key Features:

● Unified IR enables advanced cross-optimizations between ML and DB operators

● Takes advantage of ML runtimes integrated into Microsoft SQL Server Raven

Background: ONNX (Open Neural Network Exchange)

Standardized ML model representation

Enables portable models, cross-platform inference

Integrated ONNX Runtime into SQL Server

Background: MLflow Model PipelinesModel Pipeline contains:

● Trained Model ● Preprocessing Steps● Dependencies

Background: MLflow Model PipelinesModel Pipeline contains:

● Trained Model ● Preprocessing Steps● Dependencies

Pipeline is stored in the RDBMS

A SQL query can invoke a Model Pipeline

Defining a Model Pipeline

Using a Model Pipeline

Raven’s IR

Uses both ML and relational operators

Raven’s IROperator categories:

● Relational Algebra● Linear Algebra● Other ML and data featurizers (eg scikit-learn operations)● UDFs, for when the static analyzer is unable to map operators

Cross-Optimization

Cross-OptimizationDBMS + ML optimizations:

● Predicate-based model pruning: conditions are pushed into the decision tree, allowing subtrees to be pruned

● Model-projection pushdown: unused features are discarded in the query plan● NN translation: Transform operations/models into equivalent NNs, leverage

optimized runtime● Model / query splitting: Model can be partitioned ● Model inlining: ML operators transformed to relational ones (eg small decision

trees can be inlined)

Other standard DB and compiler optimizations (constant folding, join elimination, …)

Runtime Code Generation

A new SQL query is generated from the optimized IR, and invoked on the integrated SQL Server+ONNX Runtime engine.

Raven: Full Pipeline

Query ExecutionThe generated SQL query is executed as either,

1. In-process execution (Raven): uses the integrated PREDICT statement2. Out-of-process execution (Raven Ext): Unsupported pipelines are executed

as an external script3. Containerized execution: if unable to run Raven Ext., run within Docker

ResultsEffects of some optimizations:

Results: does Raven outperform ONNX?

Strengths & Weaknesses: RavenStrengths:

- Brings advantages of a DBMS to machine learning models- Raven’s cross-optimizations and ONNX integration make inference faster- Very generalized: natively supports many ML frameworks, runtimes,

specialized hardware, and provides containerized execution for all others

Weaknesses:

- Only compatible with SQL Server - Limited to inference; it does not facilitate training- Limits to static analysis (eg analysis of for loops & conditionals)

Questions: Raven

Motivation: a typical ML pipelineCollect Data, Materialize Joins ML / LA (Linear Algebra) Operations

Morpheus: Factorized MLIdea: avoid materializing joins by pushing ML computations through joins

Any Linear Algebra computation over the join output can be factorized in terms of

the base tables

Morpheus (2017): Factor operations over a “normalized matrix” using a framework of rewrite rules.

Morpheus

The Normalized MatrixA logical data type that represents joined tables

Consider a PK-FK join between two tables S, R

The normalized matrix is the triple (S, K, R)

Where K is an indicator matrix,

The output of the join, T, is then T = [S, KR] (column-wise concatenation)

KeyS: left tableR: right tableK: indicator matrix (0/1s)T: normalized matrix

Rewrite Rules● Element-wise operators: f(T) → (f(S), K, f(R))● Aggregators:

○ rowSum(T) → rowSum(S) + K rowSum(R)○ colSum(T) → [colSum(S), colSum(K)R]○ sum(T) → sum(S) + colSum(K) rowSum(R)

● Left Matrix Multiplication: TX → SX[1 : dS, ] + K(RX[dS + 1 : d, ])● Matrix Inversion:

KeyS: left tableR: right tableK: indicator matrix (0/1s)T: normalized matrix

Rewrite Rules Applied to Logistic RegressionKey

S: left tableR: right tableK: indicator matrix (0/1s)T: normalized matrix

PerformanceKey

F: FactorizedM: MaterializedFR: feature ratioTR: tuple ratio

PerformanceKey

Domain Size - # of unique values

MorpheusFI: Quadratic Feature Interactions for Morpheus

Limitation: Factorized Linear Algebra is restricted to linearity over feature vectors

MorpheusFIAdd quadratic feature interactions into factorized ML by adding two non-linear interaction operators:

● self-interaction within a matrix● cross-interaction between matrices

participating in a join

A new abstraction: Interacted Normalized Matrix with the following relationships:

Formal proofs of algebraic rewrite rules

Rewrite rules are extremely complex

Results: Matrix multiplicationKey

LMM: Left matrix multRMM: Right matrix multp: # of joined tablesq: # of sparse dimension tables

Results: time to convergence

Strengths & Weaknesses: Morpheus / MorpheusFIStrengths:

- Automatically rewrite any LA computation over a join’s output as LA operations over the base tables

- Decouples factorization and execution; backend-agnostic- In many cases, it can dramatically improve runtime- MorpheusFI extends Morpheus to support quadratic feature interactions

Weaknesses:

- It cannot be generalized to support higher degree interactions- At the time of the publication, only a simple heuristics-based approach to

optimizing the execution plan- Only supports ML models that can be expressed in linear algebraic terms

Questions: Morpheus / MorpheusFI

Discussion

CommonalitiesOverall objectives:

- Empower database users to use ML from their DBMS- Avoid the high cost of doing “offline” ML- Aim for flexibility + generalizability- Improve efficiency

Optimization via Translation:

- Raven: Translate data and ML operations into a unified IR

- MorpheusFI: Translate ML models into LA operations

Rafiki

Raven

MorpheusFI

DifferencesImplementation:

- Raven: major modifications to DBMS engine- Rafiki: cloud application, many interfaces- MorpheusFI: lightweight Python library

Generalizability:

- MorpheusFI: works for any ML model built with LA operators with linear or quadratic feature interactions

- Raven: native support for many popular models and frameworks, and out-of-process execution for all others

- Rafiki: only supports a limited set of models + runtimes

Rafiki

Raven

Morpheus

DifferencesInference vs Training:

- Raven: training in the cloud, inference in the DBMS- Rafiki: training & inference in the cloud, with a

DBMS interface

Rafiki

Raven

Morpheus

Questions & Discussion

Questions & Discussion● How will ML be affected by stricter data governance? How can DBs play a

role?● What challenges remain (technical or not) for applying ML in an enterprise

setting?● What challenges can the DBs solve? ● What's the role of the DBs in ML?● Reasons not to invest in DBMS + ML?

jan. 29, 2020 dbms + ml josh sennett julian oks

Documents