time series analysis for network secruity

Post on 27-Aug-2014

1.039 Views

Category:

Software

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

How Endgame is using the scientific computing stack in Python to find anomalies in network flow data.

TRANSCRIPT

1Endgame Proprietary

2

Time Series Analysis for Network Security

Phil RothData Scientist @ Endgame

mrphilroth.com

33

First, an introduction. My history of Python scientific computing, in function calls:

44

os.path.walk

Physics Undergraduate @ PSU

AMANDA Neutrino Telescope

55

pylab.plot

Physics Graduate Student @ UMD

IceCube Neutrino Telescope

66

numpy.fft.fft

Radar Scientist @ User Systems, Inc.

Various Radar Simulations

77

pandas.io.parsers.read_csv

Side Projects

Scraping data from the web

88

sklearn.linear_model.LogisticRegression

Side Projects

Machine learning competitions

99

(the rest of this talk…)

Data Scientist @ Endgame

Time Series Anomaly Detection

1010

Problem:Highlight when recorded metrics deviate from

normal patterns.

for example: a high number of connections might be anindication of a brute force attack

for example: a large volume of outgoing data might be anindication of an exfiltration event

1111

Solution:Build a system that can track and store

historical records of any metric. Develop an algorithm that will detect irregular behavior

with minimal false positives.

1212

Gathering Datakairos

kafka-pythonpyspark

Building Modelsclassification

ewmaarima

1313

real timestream

batchhistorical

RedisIn memory

key-value data store

HDFSLarge scale distributed data store

Kafka TopicsDistributed message passing

Data Sources

data flow

1414

kairos

A Python interface to backend storage databases (redis in my case, others available) tailored for time series storage.

Takes care of expiring data and different types of time series (series, histogram, count, gauge, set).

Open sourced by Agora Games.

https://github.com/agoragames/kairos

1515

kairos

Example code:

from redis import Redisfrom kairos import Timeseries

intervals = {"days" : {"step" : 60, "steps" : 2880}, "months" : {"step" : 1800, "steps" : 4032}}

rclient = Redis(“localhost”, 6379)ktseries = Timeseries(rclient, type="histogram”, intervals=intervals)

ktseries.insert(metric_name, metric_value, timestamp)

1616

kafka-python

A Python interface to Apache Kafka, where Kafka is publish-subscribe messaging rethought as a distributed commit log.

Allows me to subscribe to the events as they come in real time.

https://github.com/mumrah/kafka-python

1717

kafka-python

from kafka.client import KafkaClientfrom kafka.consumer import SimpleConsumer

kclient = KafkaClient(“localhost:9092”)kconsumer = SimpleConsumer(kclient, “timevault, “rawmsgs”)

for message in kconsumer : insert_to_kairos(message)

Example code:

1818

pyspark

A Python interface to Apache Spark, where Spark is a fast and general engine for large scale data processing.

Allows me to fill in historical data to the time series when I add or modify metrics.

http://spark.apache.org/

1919

pyspark

from pyspark import SparkContext, SparkConf

spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”))sc = SparkContext(conf=spark_conf)

rdd = (sc.textFile(hdfs_files) .map(insert_to_kairos) .count())

Example code:

2020

pyspark

from json import loadsimport timevault as tvfrom functools import partialfrom pyspark import SparkContext, SparkConf

spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”))sc = SparkContext(conf=spark_conf)

rdd = (sc.textFile(tv.conf.hdfs_files) .map(loads) .flatMap(tv.flatten_message) .flatMap(partial(tv.emit_metrics, metrics=tv.metrics_to_emit)) .filter(lambda tup : tup[2] < float(tv.conf.limit_time)) .mapPartitions(partial(tv.insert_to_kairos, conf=tv.conf) .count())

Example code:

2121

the end resultfrom pandas import DataFrame, to_datetime

series = ktseries.series(metric_name, “months”, transform=transform)ts, fields = zip(*series.items())df = DataFrame({"data” : fields}, index=to_datetime(ts, unit="s"))

2222

building models

First naïve model is simply the mean and standard deviation across all time.

blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit

2323

building models

Second slightly less naïve model is fitting a sine curve to the whole series.

blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit

2424

classification

Both naïve models left a lot to be desired. Two simple classifications would help us treat different types of time series appropriately:

Does this metric show a weekly pattern (ie. different behavior on weekends versus weekdays)?

Does this metric show a daily pattern?

2525

classificationFit a sine curve to the weekday and weekend periods.

Ratio of the level of those fits to determine if weekdays will be divided from weekends.

weekly

2626

classification weekly

from scipy.optimize import leastsq

def fitfunc(p, x) : return (p[0] * (1 - p[1] * np.sin(2 * np.pi / (24 * 3600) * (x + p[2]))))

def residuals(p, y, x) : return y - fitfunc(p, x)

def fit(tsdf) : tsgb = tsdf.groupby(tsdf.timeofday).mean() p0 = np.array([tsgb[“conns”].mean(), 1.0, 0.0]) plsq, suc = leastsq(residuals, p0, args=(tsgb[“conns”], np.array(tsgb.index))) return plsq

2727

classification weekly

def weekend_ratio(tsdf) :

tsdf['weekday'] = pd.Series(tsdf.index.weekday < 5, index=tsdf.index) tsdf['timeofday'] = (tsdf.index.second + tsdf.index.minute * 60 + tsdf.index.hour * 3600)

wdayplsq = fit(tsdf[tsdf.weekday == 1]) wendplsq = fit(tsdf[tsdf.weekdy == 0])

return wendplsq[0] / wdayplsq[0]

0 1cutoff 1 / cutoff

No weekly variation.

2828

classification

Weekly pattern.

No weekly pattern.

weekly

2929

classificationTake a Fourier transform of the time series, and inspect the bins associated with a frequency of a day.

Use the ratio of those bins to the first (constant or DC component) in order to classify the time series.

daily

3030

classification

Time series on weekdays shown with a strong daily pattern.

Fourier transform with bins around the day frequency highlighted.

daily

3131

classification

Time series on weekends shown with no daily pattern.

Fourier transform with bins around the day frequency highlighted.

daily

3232

classificationdef daily_ratio(tsdf) :

nbins = len(tsdf) deltat = (tsdf.index[1] - tsdf.index[0]).seconds deltaf = 1.0 / (len(tsdf) * deltat)

daybin = int((1.0 / (24 * 3600)) / deltaf)

rfft = np.abs(np.fft.rfft(tsdf[“conns”])) daily_ratio = np.sum(rfft[daybin - 1:daybin + 2]) / rfft[0] return daily_ratio

daily

Find the bin associated with the frequency of a day using:

3333

ewma

Exponentially weighted moving average:

The decay parameter is specified as a span, s, in pandas, related to α by:

α = 2 / (s + 1)

A normal EWMA analysis is done when the metric shows no daily pattern. A stacked EWMA analysis is done when there is a daily pattern.

3434

ewmadef ewma_outlier(tsdf, stdlimit=5, span=15) :

tsdf[’conns_binpred’] = pd.ewma(tsdf[‘conns’], span=span).shift(1) tsdf[’conns_binstd’] = pd.ewmstd(tsdf[‘conns’], span=span).shift(1)

tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[’conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)

return tsdf

normal

3535

ewma normal

blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit

3636

ewma

blue: actual response sizegreen: prediction windowred: actual value exceeded standard deviation limit

normal

3737

ewma stacked

3838

ewma stacked

3939

ewma stacked

4040

ewmadef stacked_outlier(tsdf, stdlimit=4, span=10) :

gbdf = tsdf.groupby(‘timeofday’)[colname] gbdf = pd.DataFrame({‘conns_binpred’ : gbdf.apply(pd.ewma, span=span), ‘conns_binstd’ : gbdf.apply(pd.ewmstd, span=span)})

interval = tsdf.timeofday[1] - tsdf.timeofday[0] nshift = int(86400.0 / interval)

gbdf = gbdf.shift(nshift) tsdf = gbdf.combine_first(tsdf)

tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[‘conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)

return tsdf

stacked

Shift the EWMA results by a day and overlay them on the original DataFrame.

4141

ewma

blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit

stacked

4242

arima

I am currently investigating using ARIMA (autoregressive integrated moving average) models to make better predictions.

I’m not convinced that this level of detail is necessary for the analysis I’m doing, but I wanted to highlight another cool scientific computing library that’s available.

4343

arimafrom statsmodels.tsa.arima_model import ARIMA

def arima_model_forecast(tsdf, p, d q) :

arima_model = ARIMA(tsdf[“conns”][:-1], (p, d, q)).fit() forecast, stderr, conf_int = arima_model.forecast(1)

tsdf[“conns_binpred"][-1] = forecast[0] tsdf[“conns_binstd"][-1] = stderr[0]

return tsdf

4444

arima

blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit

p = d = q = 1

4545

takeaways

Python provides simple and usable interfaces to most data handling projects.

Combined, these interfaces can create a full data analysis pipeline from collection to analysis.

46© 2014 Endgame

top related