using data science to automate event correlation - june 2016 - dan turchin - bigpanda

27
REDUCING ALERT NOISE WITH DATA SCIENCE PREPARED FOR THE FUTURE STATE OF OPERATIONS MEETUP DAN TURCHIN | @DTURCHIN | BIGPANDA | JUNE 2016

Upload: dan-turchin

Post on 22-Jan-2018

348 views

Category:

Technology


0 download

TRANSCRIPT

REDUCING ALERT NOISE WITH DATA SCIENCE

PREPARED FOR THE FUTURE STATE OF OPERATIONS MEETUP

DAN TURCHIN | @DTURCHIN | BIGPANDA | JUNE 2016

t

OBJECTIVES

1. Discuss why data’s eating the world

2. Share how data science is solving the noisy alert problem

3. Discuss the state of innovation… and our role in it

4. Learn from each other

t

SO WHO’S THE SHORT, NERDY DUDE?

t

BUT FIRST…

HTTPS://POLLEV.COM/DANTURCHIN744

t

DATA IS EATING THE WORLD

DATA SCIENCE

Using all available data to make better

business decisions.

MACHINE LEARNING

Automating the use of statistics to infer

future behavior from past results.

t

DATA SCIENCE + MACHINE LEARNING: CASE STUDY

WHY DON’T UPS TRUCKS MAKE LEFT TURNS?

• Fuel efficiency

• Maintenance records

• Accident reports

• Driver health data

• On-time deliveries

• Package returns

• Customer surveys

• Objective: improve

service and reduce costs

• Hypotheses: minimize

miles traveled, avoid rush

hour

• Collect and analyze data

• Conclusion: only right

turns!

t

DATA SCIENCE IS IMPACTING EVERY INDUSTRY

t“…AND IT OPS DESERVES CREDIT (AND BLAME)

JAMES TURNBULL, THE ART OF MONITORING

Applications and services are now critical for customer satisfaction. IT is no longer

just a cost center. There are more hosts, applications and infrastructure are more

complex, and expectations around availability and quality are more aggressive. More

data is needed to deliver the same quality of service and often that data isn’t being collected

or is hard to find. Legacy approaches to monitoring no longer work.”

t

THE STATE OF MONITORING… IS POOR

• 80% AGREE THAT MONITORING

IS STRATEGIC.

• 12% ARE SATISFIED WITH THEIR

STRATEGY. http://bit.ly/BP_SoM

75% RECEIVE MORE THAN 50

ALERTS PER DAY.

31% OF THOSE WITH MTTR GREATER

THAN 24 HOURS ARE SATISFIED WITH

THEIR MONITORING STRATEGIES… VS.

63% WITH LOWER MTTR.

t

Aler

ts p

er m

onth

0

4,500

9,000

13,500

18,000

2000 2005 2010 2015 2020

… AND NOISE LEVELS ARE INCREASING

t

…BUT HEADCOUNT ISN’T

2000 2020

• 5 incidents per engineer per day

• 96 minutes per incident

• 400 incidents per engineer per day

• 1.2 minutes per incident

t

IS THERE A BETTER WAY TO FIX PROBLEMS FASTER?

DETECT

INVESTIGATEPREVENT

FIX

t

THREE POSSIBLE APPROACHES…

PEOPLE BOTS AUTOMATION

t

WHAT’S THE BEST WAY TO AUTOMATE EVENT CORRELATION?

HEURISTICS NLP

ASSISTED UNASSISTED ASSISTED UNASSISTED

• Optimal for: dynamic models where

new inputs affect outputs • Examples

• Air pollution • InfoSec • IT Ops

• Optimal for: static models where known

inputs have predictable outputs • Examples

• Migration patterns • Molecular sequencing • Mine detection

t

COR·RE·LA·TIONˌkôrəˈlāSH(ə)n

The extent to which two variables have a

linear relationship.

ARE THESE EVENTS RELATED… OR CORRELATED?

• • • • • • • •• • • • • • • •

• • • • • • • •

• • • • • • • •10 MINUTES

HEURISTICS-BASED CORRELATION

t

ARE THESE EVENTS RELATED… OR CORRELATED?

• • • • • • • •• • • • • • • •

• • • • • • • •

• • • • • • • •10 MINUTES

“WHENEVER THERE’S A CPU ISSUE IT’S

FOLLOWED BY A QUERY ERROR AND A DISK

I/O ISSUE WITHIN 5 MINUTES WHEN HOSTS

ARE IN THE SAME CLUSTER.”

•••

CPU SPIKE

LONG QUERY EXECUTION

DISK I/O BUFFER

• SAME CLUSTER

HEURISTICS-BASED CORRELATION

t

tag(s) time window filter = matching events+ +DEFINITION

cluster 30 minutes source_system=api.* AND cluster NOT IN [“stage-*”] = matching events+ +EXAMPLE

All the alerts in an incident correlated by this rule will have the same cluster, the time between the creation of the first and most recent alert will be no more than 30 minutes,

and all matching alerts will meet the filter conditions.

SAMPLE HEURISTIC

t

WHO CARES?

What should I work on next?

What’s about to break?

How does that impact the business?

PRIORITIZE

INVESTIGATE

PREVENT

t

YOUR CUSTOMERS CARE.

BEFORE

DURING

AFTER

t

WHAT’S AHEAD?

t

THE EXISTENTIAL THREAT…

…IS REAL

t

THREE… TWO… ONE…

t

WHY ARE WE HERE?

WHAT CAN ONLY WE DO?

CAN MACHINES THINK? SHOULD THEY?

t

AS PROBLEMS GET HARDER…

…WE RISE TO THE CHALLENGE

t

WILL WE AGAIN?

CONTAINERS CLOUDS MICROSERVICES

CI/CD DEVOPS

° °

°

t

FURTHER LEARNING

O’REILLY DATA SHOWTHIS WEEK IN DATA

t

“THE BEST WAY OUT IS ALWAYS THROUGH.”

-ROBERT FROST

DAN TURCHIN | BIGPANDA

[email protected] | @DTURCHIN | (650)533-0918