datascience* - splunkconf · aboutme! developerandsecurityprofessionalfor14+years*...

Copyright © 2013 Splunk Inc.

Ron Naken Principal Engineer #splunkconf

Data Science

Legal NoCces During the course of this presentaCon, we may make forward-‐looking statements regarding future events or the expected performance of the company. We cauCon you that such statements reflect our current expectaCons and esCmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-‐looking statements, please review our filings with the SEC. The forward-‐looking statements made in this presentaCon are being made as of the Cme and date of its live presentaCon. If reviewed aRer its live presentaCon, this presentaCon may not contain current or accurate informaCon. We do not assume any obligaCon to update any forward-‐looking statements we may make. In addiCon, any informaCon about our roadmap outlines our general product direcCon and is subject to change at any Cme without noCce. It is for informaConal purposes only and shall not, be incorporated into any contract or other commitment. Splunk undertakes no obligaCon either to develop the features or funcConality described or to include any such feature or funcConality in a future release.

Splunk, Splunk>, Splunk Storm, Listen to Your Data, SPL and The Engine for Machine Data are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respecCve

owners.

©2013 Splunk Inc. All rights reserved.

2

About Me

!   Developer and Security Professional for 14+ years

!   Best known for integraCon work with as/400, NetApp, and ServiceNow

!   Studied psychology at the University of California, Irvine, and now helps

customers to envision creaCve ways to apply operaConal intelligence,

using mathemaCcs to evoke "human thought” from data

3

About My Company

splunk> take the SH out of IT…

4

Agenda

IntroducCon: Data Science ! Why Learn It? ! What Should We Focus on During this Course? ! What is It?

Abnormal Behavior ! Detect Abnormal Behavior ! Calculate Dynamic Thresholds

Standardizing Ab(normal) ! Correlate Seemingly Unrelated Data Sources ! Calculate Probability Without Complex Formulas

5

Agenda

Temporal Proximity ! AutomaCcally Correlate Issues to Root-‐cause ! Splunk Just Did Your Job for You!

RelaCve Volume ! Detect Abnormal Data Volumes ! Find QuesCons When our Data is Full of Answers

6

IntroducCon: Data Science Why do we want to learn about data science?

It allows Splunk to THINK LIKE A HUMAN

and because Splunk can do our thinking, Splunk can do our work: !   AutomaCcally correlate root-‐cause to incidents !   AutomaCcally find abnormal errors or warnings !   AutomaCcally find people doing abnormal things

This means we can ask Splunk to think things through for us, like when CPU is looking abnormal, even though normal CPU levels may vary between hour of the day or day of the week * This is one example we will see, later in the chapter

7

IntroducCon: Data Science Keep this in mind as we go through the course:

Splunk is EASY: !   If something seems difficult or complex, we’re probably overthinking it !   Complexity lies in remembering not to overthink a problem

StaCsCcs is EASY: !   Because Splunk does it for you! !   So while we’re going to cover a moderate amount of it, just focus on

understanding the benefits of what the formulas accomplish

8

IntroducCon: Data Science What is data science?

among its many meanings, we will focus on the following:

…build on techniques and theories – from many areas of study (i.e. mathemaCcs, staCsCcs, patern recogniCon, etc.) – to extract meaning from data…

9

Abnormal Behavior

IntroducCon: StaCsCcs

What is normal?...

htp://www.mathsisfun.com/data/standard-‐deviaCon.html

11

IntroducCon: StaCsCcs Standard deviaCon can be used to detect ‘normality’:

Standard deviaCon (σ) = √ variance Variance (σ²) = (distance from mean)² + … / n σ² = ( ∑ i² ) / n i = distance from mean

i∈S

Two types of standard deviaCon and variance:

Popula<on – dataset represents the enCre relevant ‘world’ stdevp(), varp()!Sample – dataset is a ‘sample’ from a larger relevant ‘world’ stdev(), var()!

Sample-‐based variance and sample-‐based standard deviaCon are calculated as (n – 1) in place of (n)

12


What is normal?...

htp://www.mathsisfun.com/data/standard-‐deviaCon.html

mean

mean + σ

mean -‐ σ

13

(ab)normal Human Behavior

14

sourcetype=mmo:acCons | bucket _Cme span=10m | stats count AS c BY _Cme toon | eventstats avg(c) AS mean stdevp(c) AS sdev | where c >

(mean + sdev) | stats count sparkline BY toon | sort -‐ count

CalculaCng abnormal heights within a populaCon of dogs is the same as calculaCng “abnormal” human behavior

Look how simple this search is that idenCfies “bo}ng” in a popular MMORPG (Massively MulCplayer Online Role-‐Playing Game)

1

This search exemplifies the simplicity by which abnormal behavior can be detected by Splunk. It makes the assumpCon that “abnormal” is defined as outside of 1 standard deviaCon. Later in the chapter, we will invesCgate z-‐values which further simplify data

and allow us to make some universal assumpCons.

(ab)normal Machine Behavior

15

sourcetype="WMI:CPUTime" | bucket _Cme span=10m | stats max(PercentProcessorTime) AS cpu_max BY _Cme date_wday date_hour | eventstats avg(cpu_max) AS max_avg

stdevp(cpu_max) AS sd BY date_wday date_hour | eval cpu_ciel=max_avg + sd | eval earliest=relaCve_Cme(now(), "-‐1d@d") | eval latest=relaCve_Cme(now(), "@d") | where

cpu_max > cpu_ciel AND _Cme > earliest AND _Cme < latest | fields -‐ earliest latest

CalculaCng abnormal CPU uClizaCon for a Windows machine is also the same

This example takes the maximum CPU uClizaCon for every 10 minute window of a day and compares it against what we expect to be “normal” during the specific hour of the given day; we will see how to exclude today’s data from the baseline in a later secCon on z-‐values and storage

Calculate max CPU uClizaCon for each 10 minute window for each day of the week. Note the split by _Cme, day, and hour.

2

Add standard deviaCon and mean to the dataset. Note the split here is by hour and day, not including _Cme.

The table represents each Cme the CPU went to abnormal levels – this is our upcoming example; the chart uses the same search, with a modified “where” clause and “fields” command

(ab)normal Machine Behavior

16

sourcetype="WMI:CPUTime" | bucket _Cme span=10m | stats max(PercentProcessorTime) AS cpu_max BY _Cme date_wday date_hour | eventstats avg(cpu_max) AS max_avg

stdevp(cpu_max) AS sd BY date_wday date_hour | eval cpu_ciel=max_avg + sd | eval earliest=relaCve_Cme(now(), "-‐1d@d") | eval latest=relaCve_Cme(now(), "@d") | where

cpu_max > cpu_ciel AND _Cme > earliest AND _Cme < latest | fields -‐ earliest latest

CalculaCng abnormal CPU uClizaCon for a Windows machine is also the same

This example takes the maximum CPU uClizaCon for every 10 minute window of a day and compares it against what we expect to be “normal” during the specific hour of the given day; we will see how to exclude today’s data from the baseline in a later secCon on z-‐values and storage

Calculate max CPU uClizaCon for each 10 minute window for each day of the week. Note the split by _Cme, day, and hour.

2

Add standard deviaCon and mean to the dataset. Note the split here is by hour and day, not including _Cme.

Standardizing (ab)normal

IntroducCon: StaCsCcs NormalizaCon: z-‐value | z-‐score | standard score

! Much of the mathemaCcs assume a normal distribuCon, where data points form a bell curve or Gaussian curve

–  We can normalize data (normal or not) into a standard normal distribuCon or z-‐ distribuCon

! z-‐value | z-‐score | standard score –  Unitless -‐ compare across technologies –  Empirical rule -‐ 68-‐95-‐99.7 rule –  Access to z-‐table for pre-‐calculated probability –  z = (x – u) / σ

(data point – mean) / (standard deviaCon)

18

CPU to Bandwidth: Malware

19

EXAMPLE Z-‐CORRELATION SEARCH: CPU % uClizaCon in contrast to network volume with

worm spewing on the wire

In this example, we compare CPU uClizaCon (%) to bandwidth consumpCon (MBps); by converCng the disribuCons to z-‐value, we have a normalized, unitless measure to compare

IntroducCon: StaCsCcs Terminology oRen used:

!   nth PercenCle – n% of the data points fall beneath this data point; for instance, the Empirical Rule (68-‐95-‐99.7) states that 95% of data points fall within 2 standard deviaCons – this means that 2 standard deviaCons represents the 95th percenCle for a standard normal distribuCon

–  95th percenCle: 95% of the data points are below this value

!   QuarCle Q1, Q2, Q3 – represents the most common percenCles –  Q1 = First quarCle = 25th percenCle –  Q2 = Second quarCle = 50th percenCle –  Q3 = Third quarCle = 75th percenCle

If you score 2100 on the SAT, did you do well? While this may represent the 96th PercenCle for 2012, how does it compare for this year?

20


21

With our data standardized into z-‐values, forming a standard normal distribuCon, the z-‐table provides a universal matrix of probability – the chance a value will be below this point

!   Download from the internet and use as a lookup table!

What is the chance my CPU will be above 90%? u (mean) = 75, σ (standard deviaCon) = 13.4

z = (90 – 75) / 13.4 = 1.119

* closest match = 1.12

How to use z-‐table: match x.x to leR column, and 2nd decimal point to top row

86.86% chance to be below 90% =

1 -‐ .8686 = .1314 = 13.14% chance to be above 90%!

3

IntroducCon: StaCsCcs For the following example, which exam shows beter performance?

Math: 91% History: 62%

When we are graded on a scale from 1-‐100, this is an easy answer; however, if we are graded on a curve, we don’t know how well we did without more informaCon

RESULTS: Math = (91 – 90) / 1.1 : z=.909 History = (62 – 60) / 1.5 : z=1.33

u=90, σ=1.1 u=60, σ=1.5

2

22

IntroducCon: StaCsCcs For the following example, which exam shows beter performance?

Math: 91% History: 62%

When we are graded on a scale from 1-‐100, this is an easy answer; however, if we are graded on a curve, we don’t know how well we did without more informaCon

RESULTS: Math = (91 – 90) / 1.1 : z=.909 History = (62 – 60) / 1.5 : z=1.33

u=90, σ=1.1 u=60, σ=1.5

2

23


Can I do this without a z-‐table lookup or complex math?

1

z-‐Table

24


25

Empirical Rule – applies to any normal distribu<on ! 68% of data points are within 1 standard deviaCon

! 95% of data points are within 2 standard deviaCons

! 99.7% of data points are within 3 standard deviaCons

This becomes important once our data is normalized to z-‐values, because we now have a standard normal distribu<on and can make assumpCons !   Only 5% of the data points have a z-‐value greater than 2 standard deviaCons

!   Only .3% of the data points have a z-‐value greater than 3 standard deviaCons

percen<le – x% of data points fall below the value * To create general alerts to find the “needle in the haystack”, or rare outliers, it is simple to look for z-‐values based on the volume of an event that are near or greater than 3 standard deviaCons

Needle in a Haystack

26

EXAMPLE ERROR ANALYSIS: Alert on needle in haystack from overgeneralized search; use

Empirical Rule to find outliers

Temporal Proximity

Temporal Proximity There are many ways we can ask Splunk to slice-‐and-‐dice, correlate, or map Cme; in this chapter, we’re going to invesCgate a technique that is so easy, “Even a caveman can do it!”

!   Given events that contain a Cmestamp, the _Cme field represents an EPOCH Cmestamp – seconds since 01/01/1970 – of when the event occurred –  We can slice this into Cme windows and use it for correlaCon between events

!   Here are some examples: –  | eval minute_window = round(_Cme / 60, 0) –  | eval ten_minute_window = round(_Cme / (60 * 10), 0) –  | eval hour_window = round(_Cme / (60 * 60), 0) –  | eval day_window = round(_Cme / (60 * 60 * 24), 0)

!   Using one of these windows is like saying, “Show me everything that happened on Tuesday”; we can say, “correlate all errors that occurred within this ten_minute window”, or “on this day_window”, etc.

28

Time-‐Correlated Errors

29

(sourcetype=“cdn:generic” complete_percent < 100) OR `errors` | eval epoch=round(_Cme/(60 * 5), 0) | eval correlated_issues=if(sourcetype == "cdn:generic", null, sourcetype + " | " + _raw) | eval error_Cme=if(sourcetype == "cdn:generic", strRime(_Cme, "%m-‐%d-‐%y %H:

%M"), null) | stats list(error_Cme) AS error_Cme list(product) AS product list(correlated_issues) AS correlated_issues BY epoch | search error_Cme=* product=*

correlated_issues=* | sort -‐ epoch | fields -‐ epoch

Search for all incomplete CDN downloads and infrastructure errors for the Cme period. Calculate a 5-‐minute EPOCH Cme

window on each event.

Combine the events using the STATS command and a SPLIT BY our EPOCH Cme window.

WOAH! Splunk just did my job for me! The two leR columns represent a point in Cme where our CDN did not complete delivery to a customer; the right column represents the

back-‐end infrastructure issues that occurred during the same Cme window

2

Time-‐Correlated Errors

30

(sourcetype=“cdn:generic” complete_percent < 100) OR `errors` | eval epoch=round(_Cme/(60 * 5), 0) | eval correlated_issues=if(sourcetype == "cdn:generic", null, sourcetype + " | " + _raw) | eval error_Cme=if(sourcetype == "cdn:generic", strRime(_Cme, "%m-‐%d-‐%y %H:

%M"), null) | stats list(error_Cme) AS error_Cme list(product) AS product list(correlated_issues) AS correlated_issues BY epoch | search error_Cme=* product=*

correlated_issues=* | sort -‐ epoch | fields -‐ epoch

Search for all incomplete CDN downloads and infrastructure errors for the Cme period. Calculate a 5-‐minute EPOCH Cme

window on each event.

Combine the events using the STATS command and a SPLIT BY our EPOCH Cme window.

2

RelaCve Volume

RelaCve Volume We already discussed how we could use z-‐values to define abnormality; this applies to data volumes as well; in this chapter, we will discuss other methods that can be used to idenCfy abnormal volumes of data

!   z-‐values – these provide a normalized method of idenCfying outliers in data volumes

!   RelaCve raCos – these can help us visually idenCfy abnormal data volumes

!   Cluster command – a search command that groups similar events and calculates quanCty

–  Use t=<0–1> parameter to determine sensiCvity

–  Use match=(termlist | termset | ngramset)

!   Abnormal behavior in our data can help us idenCfy the right quesCons to ask

32

z-‐value: 3 Week Volume vs. Today

33

`infrastructure_data` earliest=-‐1mon@d latest=@h | bucket _Cme span=1h | eval day=if(_Cme >= relaCve_Cme(now(), "-‐7d@d"), "this", "past") | stats count AS c BY _Cme date_wday date_mday date_hour sourcetype day | eval c_tmp=c | eval c=if(day == "this", null, c) | eventstats mean(c) AS m stdevp(c) AS sd BY date_wday date_hour sourcetype | rename c_tmp AS c | where (day = "this") AND (date_mday = tonumber(strRime(now(),

"%d"))) | eval z=(c -‐ m)/sd | xyseries _Cme sourcetype z | eval Hour=strRime(_Cme, "%H") | fields -‐ _Cme

This search calculates z-‐values for data source volumes, contrasCng the past 7 days to the rest of the past month.

1

NOTE: The last 7 days are excluded from baseline calculaCons. This represents a clear issue; remember that z-‐values are standardized, and according to

the Empirical Rule, 99.7% of our data points should be within 3 standard deviaCons

z-‐value: 3 Week Volume vs. Today

34

`infrastructure_data` earliest=-‐1mon@d latest=@h | bucket _Cme span=1h | eval day=if(_Cme >= relaCve_Cme(now(), "-‐7d@d"), "this", "past") | stats count AS c BY _Cme date_wday date_mday date_hour sourcetype day | eval c_tmp=c | eval c=if(day == "this", null, c) | eventstats mean(c) AS m stdevp(c) AS sd BY date_wday date_hour sourcetype | rename c_tmp AS c | where (day = "this") AND (date_mday = tonumber(strRime(now(),

"%d"))) | eval z=(c -‐ m)/sd | xyseries _Cme sourcetype z | eval Hour=strRime(_Cme, "%H") | fields -‐ _Cme

This search calculates z-‐values for data source volumes, contrasCng the past 7 days to the rest of the past month.

1

NOTE: The last 7 days are excluded from baseline calculaCons.

RelaCve RaCos

35

`infrastructure_data` earliest=-‐1d@d latest=@h | where date_hour < tonumber(strRime(now(), "%H")) | eval day=if(_Cme < relaCve_Cme(now(), "@d"),

"yesterday", "today") | stats count AS c BY date_hour sourcetype day | eval yesterday=if(day == "yesterday", c, 0) | eval today=if(day == "today", c, 0) | stats max(yesterday) AS yesterday max(today) AS today BY date_hour sourcetype | eval raCo=if(today >= yesterday, today/

yesterday, -‐yesterday/today) | fields -‐ yesterday today | rename date_hour AS Hour | xyseries Hour sourcetype raCo | sort + Hour

RelaCve raCos can help to idenCfy abnormal behavior in the volumes of our data sources. The following search exemplifies the magnitude of volume changes when ploted on a chart.

Columns above 0 represent the magnitude of addiConal volume for today, whereas columns below 0 represent the magnitude of yesterday’s data compared to today.

1

Yesterday’s volume was considerably higher than today’s for a number of devices; this is a possible indicator there was an outage today

RelaCve RaCos

36

`infrastructure_data` earliest=-‐1d@d latest=@h | where date_hour < tonumber(strRime(now(), "%H")) | eval day=if(_Cme < relaCve_Cme(now(), "@d"),

"yesterday", "today") | stats count AS c BY date_hour sourcetype day | eval yesterday=if(day == "yesterday", c, 0) | eval today=if(day == "today", c, 0) | stats max(yesterday) AS yesterday max(today) AS today BY date_hour sourcetype | eval raCo=if(today >= yesterday, today/

yesterday, -‐yesterday/today) | fields -‐ yesterday today | rename date_hour AS Hour | xyseries Hour sourcetype raCo | sort + Hour

RelaCve raCos can help to idenCfy abnormal behavior in the volumes of our data sources. The following search exemplifies the magnitude of volume changes when ploted on a chart.

Columns above 0 represent the magnitude of addiConal volume for today, whereas columns below 0 represent the magnitude of yesterday’s data compared to today.

1

The Cluster Command

37

`infra_ops` | cluster t=.8 field=_raw match=termset | table _raw cluster_count | sort + cluster_count

The cluster command groups similar events and allows for a quick-‐and-‐dirty discovery of rare events

`infra_ops` | cluster t=.8 field=_raw match=termset | table _raw cluster_count | sort + cluster_count!

Summary We learned how to:

!   IdenCfy abnormal human behavior !   IdenCfy abnormal machine behavior and calculate dynamic thresholds !   Correlate abnormality across dissimilar data types using z-‐value !   AutomaCcally correlate root-‐cause to incidents

In summary, we learned: !   How to make Splunk think like a human !   How to find quesCons when our data is full of answers

Most importantly, we learned: ! Splunk is EASY !   StaCsCcs is HARD, but it’s EASY in Splunk!

38

Next Steps

39

Download the .conf2013 Mobile App If not iPhone, iPad or Android, use the Web App

Take the survey & WIN A PASS FOR .CONF2014… Or one of these bags! Go to “How to Use Dynamic Drilldown” Nolita 2, Level 4 Today, 3-‐4pm

1

2

3

THANK YOU

IntroducCon: StaCsCcs Resistance to Outliers

median – the value separaCng the higher and lower halves of a sample !   If the set contains an even number of values, we normally average the two middle values. mode – the most common value in a sample !   sample set: 1, 2, 2, 2, 3, 20 !   median = 2, mode = 2, mean (average) = 5

–  someCmes it makes sense to use median, in place of mean, in order to account for outliers. This can be important when the sample is small

SIDE NOTE: mode can be important in retail analyCcs to understand things like what shirt/pant size is the most common (restocking)

41

datascience* - splunkconf · aboutme*! developer*and*security*professional*for*14+years*...

Documents

datascience* - splunkconf · aboutme! developerandsecurityprofessionalfor14+years*...