jan. 29, 2020 dbms + ml josh sennett julian oks
TRANSCRIPT
DBMS + MLJulian Oks
Josh SennettJan. 29, 2020
Context + Problem Statement
Context: DBMS + ML
DBMS / RDBMS- Prevalent in all industries
- Efficient
- Highly reliable & available
- Consistent & Transactional
- Provide concurrency
- Declarative API (SQL)
- Support for versioning, auditing, encryption
Context: DBMS + ML
Machine Learning- Hardware is expensive and resource usage is non-uniform, but cloud
computing makes it affordable
- Data science expertise is expensive too, but ML services and tools aim to
make it accessible
- Becoming more mainstream; no longer exclusive to “unicorn” ML applications
- Growing focus on fairness, security, privacy, and auditability
“Typical applications of ML are built by smaller, less experienced teams, yet they have more stringent demands.”
Context: DBMS + ML
Challenges faced in employing machine learning:
- Many ML frameworks- Many ML algorithms- Heterogeneous hardware + runtime environments- Complex APIs -- typically requiring data science experts- Complexity of model selection, training, deployment, and improvement- Lack of security, auditability, versioning- Majority of time is spent on data cleaning, preprocessing, and integration
Can DBMS + ML integration address these challenges?
Life without DBMS + ML integration
1. Join tables, export, create features, and do offline ML
Problem: slow, memory + runtime intensive, redundant work
2. Write super-complex SQL UDFs to implement ML models
Problem: large, complex models are hard and often impossible to implement in SQL. Writing models from scratch is expensive and does not take advantage of existing ML frameworks and algorithms
Recent trends in DBMS + ML
Recent trends in DBMS + ML
Problem StatementSome Big Picture Questions:
- How do you make ML accessible to a typical database user? - How do you provide flexibility to use the right ML model?- How do you support different frameworks, cloud service providers, and
hardware and runtime environments?- Data is often split across tables; can we do prediction without needing to
materialize joins across these tables?- Can DBMS + ML efficiency match (or outperform) ML alone?
Tradeoff: simplicity vs. flexibility
Rafiki: Motivation
Building an ML application is complex, even when using cloud platforms and popular frameworks.
- Training:- Many models to choose from- Many (critical) hyperparameters to tune
- Inference:- Using a single model is faster, but less accurate than
using a model ensemble- Need to select the right model(s) to trade off accuracy
and latency
Rafiki
Rafiki: Approach
Training Service: automate model selection and hyperparameter tuning
- automated search to find the “best” model- highest accuracy- lowest memory usage - lowest runtime
Inference Service: online ensemble modeling
- automated selection of the “best” model (or ensemble) - maximize accuracy- minimize excess latency (time exceeding SLO)
Automate development and training of a new ML model:- distributed model selection- distributed hyper-parameter tuning
Rafiki: Training Service
Automate development and training of a new ML model:- distributed model selection- distributed hyper-parameter tuning
Rafiki: Training Service
Automate development and training of a new ML model:- distributed model selection- distributed hyper-parameter tuning
Rafiki parallelizes hyper-parameter tuning to reduce tuning time
Rafiki: Inference Service
Automate model selection and scheduling:
- Maximize accuracy- Minimize latency exceeding SLO
These are typically competing objectives
Other optimizations (similar to Clipper):
- Parallel ensemble modeling- Batch size selection (throughput vs latency)- Parameter caching
Model ensembles improve accuracy, but are slower due to stragglers
Rafiki: Inference Service
How does Rafiki optimize model selection for its inference service?
- Pose as optimization problem- Reinforcement Learning- Objective is to maximize
accuracy while minimizing excess latency (beyond SLO)
Rafiki: How does this integrate with DBMS?
- REST API via SQL UDF
CREATE FUNCTION food_name(image_path) RETURNS textASBEGIN ... -- CALL RAFIKI API -- ...END;
SELECTfood_name(image_path) AS name, count(*)FROM foodlogWHERE age > 52GROUP BY name;
- Python SDK
Interfaces
Web Interface Python UDF & Web API
Strengths & Weaknesses: RafikiStrengths:
- Allows users to specify an SLO that’s used for model selection and inference- Easy to use for large class of tasks (ie regression, classification, NLP)- Automates and optimizes complex decisions in ML design and deployment
Weaknesses:
- Not very general: you have to use Rafiki’s limited set of models, model selection + tuning algorithms, and model ensemble strategy
- Could be very expensive, since you have to train many models to find best- Rafiki has to compete with other offerings for automated model tuning and
model selection services
Questions: Rafiki
MotivationDBMSs have many advantages
● High Performance● Mature and Reliable● High availability, Transactional, Concurrent Access, ...● Encryption, Auditing, Familiar & Prevalent, ….
Store and serve models in the DBMS
Question: Can in-RDBMS scoring of ML models match (outperform?) the performance of dedicated frameworks?
Answer: Yes!
RavenSupports in-DB model inference
Key Features:
● Unified IR enables advanced cross-optimizations between ML and DB operators
● Takes advantage of ML runtimes integrated into Microsoft SQL Server Raven
Background: ONNX (Open Neural Network Exchange)
Standardized ML model representation
Enables portable models, cross-platform inference
Integrated ONNX Runtime into SQL Server
Background: MLflow Model PipelinesModel Pipeline contains:
● Trained Model ● Preprocessing Steps● Dependencies
Background: MLflow Model PipelinesModel Pipeline contains:
● Trained Model ● Preprocessing Steps● Dependencies
Pipeline is stored in the RDBMS
A SQL query can invoke a Model Pipeline
Defining a Model Pipeline
Using a Model Pipeline
Raven’s IR
Uses both ML and relational operators
Raven’s IROperator categories:
● Relational Algebra● Linear Algebra● Other ML and data featurizers (eg scikit-learn operations)● UDFs, for when the static analyzer is unable to map operators
Cross-Optimization
Cross-OptimizationDBMS + ML optimizations:
● Predicate-based model pruning: conditions are pushed into the decision tree, allowing subtrees to be pruned
● Model-projection pushdown: unused features are discarded in the query plan● NN translation: Transform operations/models into equivalent NNs, leverage
optimized runtime● Model / query splitting: Model can be partitioned ● Model inlining: ML operators transformed to relational ones (eg small decision
trees can be inlined)
Other standard DB and compiler optimizations (constant folding, join elimination, …)
Runtime Code Generation
A new SQL query is generated from the optimized IR, and invoked on the integrated SQL Server+ONNX Runtime engine.
Raven: Full Pipeline
Query ExecutionThe generated SQL query is executed as either,
1. In-process execution (Raven): uses the integrated PREDICT statement2. Out-of-process execution (Raven Ext): Unsupported pipelines are executed
as an external script3. Containerized execution: if unable to run Raven Ext., run within Docker
ResultsEffects of some optimizations:
Results: does Raven outperform ONNX?
Strengths & Weaknesses: RavenStrengths:
- Brings advantages of a DBMS to machine learning models- Raven’s cross-optimizations and ONNX integration make inference faster- Very generalized: natively supports many ML frameworks, runtimes,
specialized hardware, and provides containerized execution for all others
Weaknesses:
- Only compatible with SQL Server - Limited to inference; it does not facilitate training- Limits to static analysis (eg analysis of for loops & conditionals)
Questions: Raven
Motivation: a typical ML pipelineCollect Data, Materialize Joins ML / LA (Linear Algebra) Operations
Morpheus: Factorized MLIdea: avoid materializing joins by pushing ML computations through joins
Any Linear Algebra computation over the join output can be factorized in terms of
the base tables
Morpheus (2017): Factor operations over a “normalized matrix” using a framework of rewrite rules.
Morpheus
The Normalized MatrixA logical data type that represents joined tables
Consider a PK-FK join between two tables S, R
The normalized matrix is the triple (S, K, R)
Where K is an indicator matrix,
The output of the join, T, is then T = [S, KR] (column-wise concatenation)
KeyS: left tableR: right tableK: indicator matrix (0/1s)T: normalized matrix
Rewrite Rules● Element-wise operators: f(T) → (f(S), K, f(R))● Aggregators:
○ rowSum(T) → rowSum(S) + K rowSum(R)○ colSum(T) → [colSum(S), colSum(K)R]○ sum(T) → sum(S) + colSum(K) rowSum(R)
● Left Matrix Multiplication: TX → SX[1 : dS, ] + K(RX[dS + 1 : d, ])● Matrix Inversion:
KeyS: left tableR: right tableK: indicator matrix (0/1s)T: normalized matrix
Rewrite Rules Applied to Logistic RegressionKey
S: left tableR: right tableK: indicator matrix (0/1s)T: normalized matrix
PerformanceKey
F: FactorizedM: MaterializedFR: feature ratioTR: tuple ratio
PerformanceKey
Domain Size - # of unique values
MorpheusFI: Quadratic Feature Interactions for Morpheus
Limitation: Factorized Linear Algebra is restricted to linearity over feature vectors
MorpheusFIAdd quadratic feature interactions into factorized ML by adding two non-linear interaction operators:
● self-interaction within a matrix● cross-interaction between matrices
participating in a join
A new abstraction: Interacted Normalized Matrix with the following relationships:
Formal proofs of algebraic rewrite rules
Rewrite rules are extremely complex
Results: Matrix multiplicationKey
LMM: Left matrix multRMM: Right matrix multp: # of joined tablesq: # of sparse dimension tables
Results: time to convergence
Strengths & Weaknesses: Morpheus / MorpheusFIStrengths:
- Automatically rewrite any LA computation over a join’s output as LA operations over the base tables
- Decouples factorization and execution; backend-agnostic- In many cases, it can dramatically improve runtime- MorpheusFI extends Morpheus to support quadratic feature interactions
Weaknesses:
- It cannot be generalized to support higher degree interactions- At the time of the publication, only a simple heuristics-based approach to
optimizing the execution plan- Only supports ML models that can be expressed in linear algebraic terms
Questions: Morpheus / MorpheusFI
Discussion
CommonalitiesOverall objectives:
- Empower database users to use ML from their DBMS- Avoid the high cost of doing “offline” ML- Aim for flexibility + generalizability- Improve efficiency
Optimization via Translation:
- Raven: Translate data and ML operations into a unified IR
- MorpheusFI: Translate ML models into LA operations
Rafiki
Raven
MorpheusFI
DifferencesImplementation:
- Raven: major modifications to DBMS engine- Rafiki: cloud application, many interfaces- MorpheusFI: lightweight Python library
Generalizability:
- MorpheusFI: works for any ML model built with LA operators with linear or quadratic feature interactions
- Raven: native support for many popular models and frameworks, and out-of-process execution for all others
- Rafiki: only supports a limited set of models + runtimes
Rafiki
Raven
Morpheus
DifferencesInference vs Training:
- Raven: training in the cloud, inference in the DBMS- Rafiki: training & inference in the cloud, with a
DBMS interface
Rafiki
Raven
Morpheus
Questions & Discussion
Questions & Discussion● How will ML be affected by stricter data governance? How can DBs play a
role?● What challenges remain (technical or not) for applying ML in an enterprise
setting?● What challenges can the DBs solve? ● What's the role of the DBs in ML?● Reasons not to invest in DBMS + ML?