talking to the machines: monitoring production machine learning … · 2020. 9. 16. ·...

22
CONFIDENTIAL 2019 Talking to the Machines: Monitoring Production Machine Learning Systems Ting-Fang Yen, Director of Research, DataVisor September 12, 2019

Upload: others

Post on 30-Dec-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019

Talking to the Machines: Monitoring Production Machine Learning Systems

Ting-Fang Yen, Director of Research, DataVisor

September 12, 2019

Page 2: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 2

Machine Learning Applications Everywhere

Page 3: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019

Production ML Pipeline is Complex

3

Sculley, D., et al. "Hidden technical debt in machine learning systems." NIPS. 2015.

“The required surrounding infrastructure is vast and complex.”

Page 4: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 4

Machine Learning Pipeline at DataVisor

Real-time fraud detection service powered by unsupervised machine learning (UML) technologyBillions of application-level events processed daily

Correlation engine of billions of users

Temporal Eventseq

Velocity / freq

Spatial / geo

Domainattributes

Graphattributes

Data Collection and Storage

Data Processing

Feature Generation

UML Clustering Analysis

Raw Data

Case Management UI / API

Supervised Machine Learning

Automated Rule Engine

GIN

DataVisor Global Intelligence Network

Page 5: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 5

What Can Go Wrong?

- Misconfigurations / Buggy pipeline- Input and output location- Module dependencies- Version control

- Resource Management- Out of memory- Out of disk- Job scheduling- Other job failures

- Data Issues- Data upload problems- Data format or parsing error

- Bugs in code- Runtime exceptions

- Model accuracy- …

Page 6: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 6

Monitor Production ML Pipelines

Page 7: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 7

Monitor Production ML Pipelines

Page 8: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 8

Making Job Scheduling/Dependencies Simpler

DataVisor SparkGen: One job per cluster• High utilization

• No inter-job interference• Maximize job concurrency• Low maintenance

Time

Time

Sequential

One Job Per Cluster

A

D

C F G

B E

J

K

I

M

L

H

A

B

C

D

E

F G J

H

K M

I L

A B C D E F G H I LJ K M

Multiple Clusters

Time

A B

C

D E

F G

H

I

LJ

K

M

Page 9: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 9

What About Model Prediction Quality?

Common approaches for measuring model prediction quality• Labeled data

• Holdout testing• Business metrics• Manual review

Can we come up with other metrics for evaluating machine learning models?

ML Service

ModelData

Predictions

95DataData ???

Page 10: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 10

Defining Model Quality…

• Comparing with another model (the “surrogate”) • For example, a “simplified” model is used for better runtime performance. Run the

“full” model to monitor performance of “simplified” model

• Goodness of clustering• E.g., Silhouette index, prediction boundaries

• Quantify model “drift”• Model-specific metrics?• Meta-data about model internals and output

• Detections triggered by each feature• Number of other users a given user is associated with • Detection volume by attack type / category• Prediction score distribution• …

Page 11: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 11

Our Approach

An anomaly detection problem• Assume model at deployment time is “good”

• Monitor for subsequent changes in model output

Page 12: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 12

Time Series Decomposition

Seasonal decomposition of time series by Loess (STL)• Additive decomposition: Yt = Tt + St + Rt

References:• Cleveland et al. “STL: a seasonal-trend

decomposition” Journal of official statistics 6(1) pp. 3-73, 1990

• Vallis et al. "A Novel Technique for Long-Term Anomaly Detection in the Cloud" HotCloud,2014.

Trend (Tt)

Seasonal (St)

Residual (Rt)

Page 13: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 13

Time Series Decomposition – with Trend Replacement

References:Hochenbaum et al. “Automatic Anomaly Detection in the Cloud Via Statistical Learning.” arXiv preprint arXiv:1704.07706, 2017

Residual (Rt) = Original (Yt) – Trend (Tt) – Seasonal (St)

Page 14: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 14

Time Series Anomaly Detection: Residual Component

MAD statistic: Median absolute deviation• Median value of the difference between each data point and the median

• A robust measure of data variability

Anomalies: Distance to median > X * MAD

Median valueX*MAD

Page 15: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 15

Breakouts: Gradual Changes

Could indicate subtle changes in model behavior

Apply heuristics on ”trend” component• Derivative of signal has the same sign over multiple,

consecutive data points

• Recent rate of change is large

Page 16: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 16

Putting it Together

Page 17: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 17

Investigating Anomalies

Group together “similar” time series• (Pearson) Correlation distance

• Clustering

• Rank time series in the same cluster as anomalous time series

Page 18: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 18

Investigating Anomalies (cont’d)

• Multidimensional scaling (MDS) for plotting

Page 19: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 19

Investigating Anomalies (cont’d)

Store results as Parquet filesLoad as PySpark dataframe (supports SQL-like operations)

Lowers technical barrier for debugging machine learning models

Page 20: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019 20

Lessons Learned

• Build resilient, modular pipelines• Decouple model decisions from data engineering• Decouple data from code, features from data

• Establish process for incident response, assign module owners• Encourage cross-team communication

• Explore alternative model evaluation metrics that fit your application/business goals

Page 21: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019

Thank you!

https://www.datavisor.com

[email protected]

Page 22: Talking to the Machines: Monitoring Production Machine Learning … · 2020. 9. 16. · Intelligence Network. CONFIDENTIAL 2019 5 What Can Go Wrong? ... • E.g., Silhouette index,

CONFIDENTIAL 2019

“By 2021, 50% of enterprises will have added unsupervised machine learning to their fraud detection solution suites. “

- Begin Investing Now in Enhanced Machine Learning Capabilities for Fraud Detection’, Gartner Research

22