analytics for large-scale time series and event data

1

Building Anomaly

Detection For Large

Scale AnalyticsIra Cohen, Chief Data Scientist16th May, 2016

2

Outline

Anomaly detection? Why do I need it?

Design principals for Anomaly Detection

What is anomaly detection?

Anomaly Detection Methods

The Anodot System

3

Why Anomaly Detection?

4

Detecting the Unknowns Saves Time + Money

Industrial IoTProactive Maintenance

Detecting issues in factories/machines

Web ServicesDetecting business incidents + unknown

business opportunities

Machine LearningClosing the “Machine Learning” loop

Tracking and detecting ”unknowns” not modeled

during training

SecurityDetection of unknown breach/attack

patterns

5

Business Incidents - More go undetected as the business grows

$$$$

$$

$$$

6

Detecting Business Incidents: Metric Driven Detection

Business

Business Generation:

Leads, visitors, usage,

engagements

App: Performance,

errors, usability

Infra utilization/state:

Middleware, network, System

e.g., Purchases per product,

Conversions per campaign…

Per Geo, user segment, page,

browser, device…

Per class, method, feature…

Per host, database, switch…

7

Detecting Business Incidents: Metric Driven Detection

Drop in # of visitors

Decrease in ad conversion on Android Price glitch – increase in

purchases / decrease in revenue

8

Setting alerts with thresholdsDashboards

Manual Detection of Business incidents

9

Manual Solutions: Drowning in a “Sea of Data”

MISSED

INCIDENTS

FALSE

ALARMS

GENUINE

ALERTS

Too many parametersto set thresholds

Too much data to analyze in

real time

10

What is Anomaly Detection?

11

Find the Anomaly

12

Anomaly Detection

12

• Ill posed problem

• What is an anomaly?

13

Anomaly Detection in Time Series Signals

Unexpected change of temporal pattern of one or more

time series signals.

14

Anomaly detection: Design Principals

15

Anomaly Detection: Design Considerations

Timeliness

Real time vs.

Retroactive Detection

Scale

100’s vs. Millions

of metrics

Rate of change

Adaptive vs. Offline

learning

Conciseness

Univariate vs.

Multivariate methods

Well defined incidents?

Supervised vs.

Unsupervised methods

16

Timeliness: Real time vs. Retroactive Detection

Real time decision making Non-real time decision making

Reduction in

visitors/revenues

Check

for bugs

Increase in product

purchase

Increase

inventory

Increase in ad conversion

w/o increase in

impressions

check for

fraud

Capacity Planning

Marketing budget allocations

Data Cleaning

Scheduled Maintenance

17

Timeliness: Real time vs. Retroactive Detection

Real time decision making Non-real time decision making

Online learning: Cannot iterate over

the data

More prone to False

Positives

Scales more easily

Batch learning: can iterate over the

data

Easier to remove False

Positives

Poor scaling

18

Rate of change

Constant change Very slow change

• Most common case• ”Closed” systems – e.g., airplanes,

large machinery

• Requires adaptive algorithms• Learn once and apply the model for

a long time

19

Conciseness of Anomalies

Univariate Anomaly Detection Multivariate Anomaly Detection

• Learn normal model for each

metric

• Anomaly detection at the metric

level

• Easier to scale

• Causes anomaly storms: Can’t

see the forest from the trees

• Easier to model many types of

behaviors

• Learn single model for all metrics

• Anomaly detection of complete

incident

• Hard to scale

• Hard to interpret the anomaly

• Often requires metric behaviour

to be homogeneous

Hybrid approach

• Learn normal model for each

metric

• Combine anomalies to single

incidents if metrics are related

• Scalable

• Can combine multiple types of

metric behaviours

20

Well defined incidents?

Yes - Supervised methods No - Unsupervised methods

• Requires a well defined set of

incidents to identify

• Learning a model to classify

samples as normal or abnormal

• Requires labeled examples of

anomalies

• Cannot detect new types of

incidents

• Learning a normal model only

• Statistical test to detect

anomalies

• Can detect any type of anomaly

known or unknown

Semi-Supervised methods

• Use few labelled examples to

improve detection of

unsupervised methods.

• Or – use unsupervised detection

for unknown cases, supervised

detection to classify already

known cases.

21

Anomaly Detection Methods

22

Unsupervised Anomaly Detection

General scheme

Step 1 Step 2 Step 3

Model the normal

behavior of the metric(s)

using a statistical model

Devise a statistical test to

determine if samples are

explained by the model.

Apply the test for each

sample. Flag as anomaly

if it does not pass the test

23

Very Simple Model

1σ1σ

2σ2σ

3σ3σ

μ

99.7%

95.4%

68%

Assume normal behavior is the

Normal distribution

Estimate the average, standard

deviation over all samples

Test: any sample |x-average|> 3*standard

deviation is abnormal

24

A single model does not fit them all!

Smooth

(stationary)

Irregular

sampling

Multi Modal Sparse

Discrete “Step”

25

Metric types distribution

Based on 50,000,000 metrics sampled from dozens of companies

Nearly constant, 2%

Discrete, 15%

Sparse, 3%

Multi Modal, 5%

Smooth, 38%

Irregular sampling, 37%

All

Industries

26

Example: The importance of modeling seasonality

Single seasonal pattern

27


Multiple seasonal patterns (“Amplitude modulation”)

28


Multiple seasons – Additive signals

29

Seasonality Distribution

Season: 3 hours,

2%

Season: 12 hours,

1%

Season: 2 hours,

1%Season: 1 hours,

1%Season: 6 hours,

0.5%

Season: 4 hours,

0.2%

Season: 5 hours,

0.1%

Season: 24 hours,

69%

Season: Weekly, 26%

Note: Only 14% of the metrics have season

30

Example Methods to detect seasonality

Finding maximums in Auto-

correlation of signal

Computationally expensive

More robust to gaps

Finding maximum(s) in Fourier

transform of signal

Challenging to detect low

frequency seasons

Challenging to discover

multiple seasons

Sensitive to missing

data

Exhaustive search based on cost

function

Computationally expensive

Robust to gaps

Challenging to discover

multiple seasons

31

Real time detection @ scale = Online learning algorithms

1

2

3

Initialize model

For each new

sample test if

anomaly

Update model

parameters with

each new sample

32

Example Online Models/Algorithms

4

2

1

3

Simple Moving

Average

Double/Triple

exponential (Holt-

Winters)

Kalman Filters +

ARIMA and

variations

Single

exponential

forgetting

33

Example: Simple exponential forgetting (Normal distribution model)

Define alpha – forgetting factor

Compute initial average, sumOfSquares

using initial samples

For each new sample, x[t]

If |x[t]-average[t-1]|> 3* Stddev[t-1]

Flag x[t] as an anomalous sample

average[t] = alpha*x[t] + (1-alpha)*average[t-1]

sumOfSquares[t] = alpha*x^2 + (1-alpha)*sumOfSquares[t-1]

Stddev[t] = sqrt(sumOfSquares[t] – average[t]^2)

34

Update rate with online models: Avoiding pitfalls

What should be the learning rate?

Too Slow

Too Fast

35


What should be the learning rate?

“Al Dente”

Auto tuning required!

36


How to update a model when there is an anomaly?

Strategy A: Update as usual

Most of the

anomaly is missed

37


Full anomaly

captured

How to update a model when there is an anomaly?

Strategy B: Adapt the learning rate

38

Batch models

1 2 3 4

Collect

historical

samples

Segment samples

to similarly

behaving segments

Cluster segments

according to some

similarity measure

Mark as anomalies

segments that are in

small or no clusters

39

Example Batch Anomaly Detection Methods

Multi-model distributions:

• Gaussian models

• Generalized

mixture models

One sided SVM

PCA

Clustering methods

(K-Means, DBScan, Mean-

Shift)

MOST COMMON IN USE

Hidden Markov Models

40

Anomaly detection methods - examples

NAME ADAPTIVE? REALTIME? SCALABLE?UNI-MULTI

VARIATE

Holt-Winters Yes Yes Yes Univariate

ARIMA + Kalman Yes Yes Yes Both

HMM No Yes No Multivariate

GMM No No No Both

DBScan No No No Multivariate

K-Means No No No Multivariate

41

Large scale anomaly detection –the Anodot system

42

Automatic Anomaly Detection in five Steps: The Anodot Way

Metrics

Collection –

Universal, scale

to millions

Normal

behavior

learning

Abnormal

behavior

learning

Behavioral

Topology

Learning

Feedback

Based Learning

1 2 3 4 5

43

Large Scale Anomaly Detection System Architecture

Kafka

Events

Queue

Anomaly

Grouping

Signals

Correlation

Map

Real-Time

Rollups StoreCassandra

Anodotd

REST

WebApp

Online

Base Line

Learning

Aggregator

Elasticsearch

DWH S3

HADOOP

HIVE

Offline

Learning

Management

&

Portal

Anodot-Web

User Mgmt

RDBMS

Customer DS

Agent

44

[email protected]

Thank you

analytics for large-scale time series and event data

Technology