analytics for large-scale time series and event data

44
1 Building Anomaly Detection For Large Scale Analytics Ira Cohen, Chief Data Scientist 16 th May, 2016

Upload: anodot

Post on 13-Feb-2017

677 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: Analytics for large-scale time series and event data

1

Building Anomaly

Detection For Large

Scale AnalyticsIra Cohen, Chief Data Scientist16th May, 2016

Page 2: Analytics for large-scale time series and event data

2

Outline

Anomaly detection? Why do I need it?

Design principals for Anomaly Detection

What is anomaly detection?

Anomaly Detection Methods

The Anodot System

Page 3: Analytics for large-scale time series and event data

3

Why Anomaly Detection?

Page 4: Analytics for large-scale time series and event data

4

Detecting the Unknowns Saves Time + Money

Industrial IoTProactive Maintenance

Detecting issues in factories/machines

Web ServicesDetecting business incidents + unknown

business opportunities

Machine LearningClosing the “Machine Learning” loop

Tracking and detecting ”unknowns” not modeled

during training

SecurityDetection of unknown breach/attack

patterns

Page 5: Analytics for large-scale time series and event data

5

Business Incidents - More go undetected as the business grows

$$$$

$$

$$$

Page 6: Analytics for large-scale time series and event data

6

Detecting Business Incidents: Metric Driven Detection

Business

Business Generation:

Leads, visitors, usage,

engagements

App: Performance,

errors, usability

Infra utilization/state:

Middleware, network, System

e.g., Purchases per product,

Conversions per campaign…

Per Geo, user segment, page,

browser, device…

Per class, method, feature…

Per host, database, switch…

Page 7: Analytics for large-scale time series and event data

7

Detecting Business Incidents: Metric Driven Detection

Drop in # of visitors

Decrease in ad conversion on Android Price glitch – increase in

purchases / decrease in revenue

Page 8: Analytics for large-scale time series and event data

8

Setting alerts with thresholdsDashboards

Manual Detection of Business incidents

Page 9: Analytics for large-scale time series and event data

9

Manual Solutions: Drowning in a “Sea of Data”

MISSED

INCIDENTS

FALSE

ALARMS

GENUINE

ALERTS

Too many parametersto set thresholds

Too much data to analyze in

real time

Page 10: Analytics for large-scale time series and event data

10

What is Anomaly Detection?

Page 11: Analytics for large-scale time series and event data

11

Find the Anomaly

Page 12: Analytics for large-scale time series and event data

12

Anomaly Detection

12

• Ill posed problem

• What is an anomaly?

Page 13: Analytics for large-scale time series and event data

13

Anomaly Detection in Time Series Signals

Unexpected change of temporal pattern of one or more

time series signals.

Page 14: Analytics for large-scale time series and event data

14

Anomaly detection: Design Principals

Page 15: Analytics for large-scale time series and event data

15

Anomaly Detection: Design Considerations

Timeliness

Real time vs.

Retroactive Detection

Scale

100’s vs. Millions

of metrics

Rate of change

Adaptive vs. Offline

learning

Conciseness

Univariate vs.

Multivariate methods

Well defined incidents?

Supervised vs.

Unsupervised methods

Page 16: Analytics for large-scale time series and event data

16

Timeliness: Real time vs. Retroactive Detection

Real time decision making Non-real time decision making

Reduction in

visitors/revenues

Check

for bugs

Increase in product

purchase

Increase

inventory

Increase in ad conversion

w/o increase in

impressions

check for

fraud

Capacity Planning

Marketing budget allocations

Data Cleaning

Scheduled Maintenance

Page 17: Analytics for large-scale time series and event data

17

Timeliness: Real time vs. Retroactive Detection

Real time decision making Non-real time decision making

Online learning: Cannot iterate over

the data

More prone to False

Positives

Scales more easily

Batch learning: can iterate over the

data

Easier to remove False

Positives

Poor scaling

Page 18: Analytics for large-scale time series and event data

18

Rate of change

Constant change Very slow change

• Most common case• ”Closed” systems – e.g., airplanes,

large machinery

• Requires adaptive algorithms• Learn once and apply the model for

a long time

Page 19: Analytics for large-scale time series and event data

19

Conciseness of Anomalies

Univariate Anomaly Detection Multivariate Anomaly Detection

• Learn normal model for each

metric

• Anomaly detection at the metric

level

• Easier to scale

• Causes anomaly storms: Can’t

see the forest from the trees

• Easier to model many types of

behaviors

• Learn single model for all metrics

• Anomaly detection of complete

incident

• Hard to scale

• Hard to interpret the anomaly

• Often requires metric behaviour

to be homogeneous

Hybrid approach

• Learn normal model for each

metric

• Combine anomalies to single

incidents if metrics are related

• Scalable

• Can combine multiple types of

metric behaviours

Page 20: Analytics for large-scale time series and event data

20

Well defined incidents?

Yes - Supervised methods No - Unsupervised methods

• Requires a well defined set of

incidents to identify

• Learning a model to classify

samples as normal or abnormal

• Requires labeled examples of

anomalies

• Cannot detect new types of

incidents

• Learning a normal model only

• Statistical test to detect

anomalies

• Can detect any type of anomaly

known or unknown

Semi-Supervised methods

• Use few labelled examples to

improve detection of

unsupervised methods.

• Or – use unsupervised detection

for unknown cases, supervised

detection to classify already

known cases.

Page 21: Analytics for large-scale time series and event data

21

Anomaly Detection Methods

Page 22: Analytics for large-scale time series and event data

22

Unsupervised Anomaly Detection

General scheme

Step 1 Step 2 Step 3

Model the normal

behavior of the metric(s)

using a statistical model

Devise a statistical test to

determine if samples are

explained by the model.

Apply the test for each

sample. Flag as anomaly

if it does not pass the test

Page 23: Analytics for large-scale time series and event data

23

Very Simple Model

1σ1σ

2σ2σ

3σ3σ

μ

99.7%

95.4%

68%

Assume normal behavior is the

Normal distribution

Estimate the average, standard

deviation over all samples

Test: any sample |x-average|> 3*standard

deviation is abnormal

Page 24: Analytics for large-scale time series and event data

24

A single model does not fit them all!

Smooth

(stationary)

Irregular

sampling

Multi Modal Sparse

Discrete “Step”

Page 25: Analytics for large-scale time series and event data

25

Metric types distribution

Based on 50,000,000 metrics sampled from dozens of companies

Nearly constant, 2%

Discrete, 15%

Sparse, 3%

Multi Modal, 5%

Smooth, 38%

Irregular sampling, 37%

All

Industries

Page 26: Analytics for large-scale time series and event data

26

Example: The importance of modeling seasonality

Single seasonal pattern

Page 27: Analytics for large-scale time series and event data

27

Example: The importance of modeling seasonality

Multiple seasonal patterns (“Amplitude modulation”)

Page 28: Analytics for large-scale time series and event data

28

Example: The importance of modeling seasonality

Multiple seasons – Additive signals

Page 29: Analytics for large-scale time series and event data

29

Seasonality Distribution

Season: 3 hours,

2%

Season: 12 hours,

1%

Season: 2 hours,

1%Season: 1 hours,

1%Season: 6 hours,

0.5%

Season: 4 hours,

0.2%

Season: 5 hours,

0.1%

Season: 24 hours,

69%

Season: Weekly, 26%

Note: Only 14% of the metrics have season

Page 30: Analytics for large-scale time series and event data

30

Example Methods to detect seasonality

Finding maximums in Auto-

correlation of signal

Computationally expensive

More robust to gaps

Finding maximum(s) in Fourier

transform of signal

Challenging to detect low

frequency seasons

Challenging to discover

multiple seasons

Sensitive to missing

data

Exhaustive search based on cost

function

Computationally expensive

Robust to gaps

Challenging to discover

multiple seasons

Page 31: Analytics for large-scale time series and event data

31

Real time detection @ scale = Online learning algorithms

1

2

3

Initialize model

For each new

sample test if

anomaly

Update model

parameters with

each new sample

Page 32: Analytics for large-scale time series and event data

32

Example Online Models/Algorithms

4

2

1

3

Simple Moving

Average

Double/Triple

exponential (Holt-

Winters)

Kalman Filters +

ARIMA and

variations

Single

exponential

forgetting

Page 33: Analytics for large-scale time series and event data

33

Example: Simple exponential forgetting (Normal distribution model)

Define alpha – forgetting factor

Compute initial average, sumOfSquares

using initial samples

For each new sample, x[t]

If |x[t]-average[t-1]|> 3* Stddev[t-1]

Flag x[t] as an anomalous sample

average[t] = alpha*x[t] + (1-alpha)*average[t-1]

sumOfSquares[t] = alpha*x^2 + (1-alpha)*sumOfSquares[t-1]

Stddev[t] = sqrt(sumOfSquares[t] – average[t]^2)

Page 34: Analytics for large-scale time series and event data

34

Update rate with online models: Avoiding pitfalls

What should be the learning rate?

Too Slow

Too Fast

Page 35: Analytics for large-scale time series and event data

35

Update rate with online models: Avoiding pitfalls

What should be the learning rate?

“Al Dente”

Auto tuning required!

Page 36: Analytics for large-scale time series and event data

36

Update rate with online models: Avoiding pitfalls

How to update a model when there is an anomaly?

Strategy A: Update as usual

Most of the

anomaly is missed

Page 37: Analytics for large-scale time series and event data

37

Update rate with online models: Avoiding pitfalls

Full anomaly

captured

How to update a model when there is an anomaly?

Strategy B: Adapt the learning rate

Page 38: Analytics for large-scale time series and event data

38

Batch models

1 2 3 4

Collect

historical

samples

Segment samples

to similarly

behaving segments

Cluster segments

according to some

similarity measure

Mark as anomalies

segments that are in

small or no clusters

Page 39: Analytics for large-scale time series and event data

39

Example Batch Anomaly Detection Methods

Multi-model distributions:

• Gaussian models

• Generalized

mixture models

One sided SVM

PCA

Clustering methods

(K-Means, DBScan, Mean-

Shift)

MOST COMMON IN USE

Hidden Markov Models

Page 40: Analytics for large-scale time series and event data

40

Anomaly detection methods - examples

NAME ADAPTIVE? REALTIME? SCALABLE?UNI-MULTI

VARIATE

Holt-Winters Yes Yes Yes Univariate

ARIMA + Kalman Yes Yes Yes Both

HMM No Yes No Multivariate

GMM No No No Both

DBScan No No No Multivariate

K-Means No No No Multivariate

Page 41: Analytics for large-scale time series and event data

41

Large scale anomaly detection –the Anodot system

Page 42: Analytics for large-scale time series and event data

42

Automatic Anomaly Detection in five Steps: The Anodot Way

Metrics

Collection –

Universal, scale

to millions

Normal

behavior

learning

Abnormal

behavior

learning

Behavioral

Topology

Learning

Feedback

Based Learning

1 2 3 4 5

Page 43: Analytics for large-scale time series and event data

43

Large Scale Anomaly Detection System Architecture

Kafka

Events

Queue

Anomaly

Grouping

Signals

Correlation

Map

Real-Time

Rollups StoreCassandra

Anodotd

REST

WebApp

Online

Base Line

Learning

Aggregator

Elasticsearch

DWH S3

HADOOP

HIVE

Offline

Learning

Management

&

Portal

Anodot-Web

User Mgmt

RDBMS

Customer DS

Agent

Page 44: Analytics for large-scale time series and event data

44

[email protected]

Thank you