time series analysis for network secruity

1Endgame Proprietary

Time Series Analysis for Network Security

Phil RothData Scientist @ Endgame

mrphilroth.com

First, an introduction. My history of Python scientific computing, in function calls:

os.path.walk

Physics Undergraduate @ PSU

AMANDA Neutrino Telescope

pylab.plot

Physics Graduate Student @ UMD

IceCube Neutrino Telescope

numpy.fft.fft

Radar Scientist @ User Systems, Inc.

Various Radar Simulations

pandas.io.parsers.read_csv

Side Projects

Scraping data from the web

sklearn.linear_model.LogisticRegression

Side Projects

Machine learning competitions

(the rest of this talk…)

Data Scientist @ Endgame

Time Series Anomaly Detection

Problem:Highlight when recorded metrics deviate from

normal patterns.

for example: a high number of connections might be anindication of a brute force attack

for example: a large volume of outgoing data might be anindication of an exfiltration event

Solution:Build a system that can track and store

historical records of any metric. Develop an algorithm that will detect irregular behavior

with minimal false positives.

Gathering Datakairos

kafka-pythonpyspark

Building Modelsclassification

ewmaarima

real timestream

batchhistorical

RedisIn memory

key-value data store

HDFSLarge scale distributed data store

Kafka TopicsDistributed message passing

Data Sources

data flow

kairos

A Python interface to backend storage databases (redis in my case, others available) tailored for time series storage.

Takes care of expiring data and different types of time series (series, histogram, count, gauge, set).

Open sourced by Agora Games.

https://github.com/agoragames/kairos

kairos

Example code:

from redis import Redisfrom kairos import Timeseries

intervals = {"days" : {"step" : 60, "steps" : 2880}, "months" : {"step" : 1800, "steps" : 4032}}

rclient = Redis(“localhost”, 6379)ktseries = Timeseries(rclient, type="histogram”, intervals=intervals)

ktseries.insert(metric_name, metric_value, timestamp)

kafka-python

A Python interface to Apache Kafka, where Kafka is publish-subscribe messaging rethought as a distributed commit log.

Allows me to subscribe to the events as they come in real time.

https://github.com/mumrah/kafka-python

kafka-python

from kafka.client import KafkaClientfrom kafka.consumer import SimpleConsumer

kclient = KafkaClient(“localhost:9092”)kconsumer = SimpleConsumer(kclient, “timevault, “rawmsgs”)

for message in kconsumer : insert_to_kairos(message)

Example code:

pyspark

A Python interface to Apache Spark, where Spark is a fast and general engine for large scale data processing.

Allows me to fill in historical data to the time series when I add or modify metrics.

http://spark.apache.org/

pyspark

from pyspark import SparkContext, SparkConf

spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”))sc = SparkContext(conf=spark_conf)

rdd = (sc.textFile(hdfs_files) .map(insert_to_kairos) .count())

Example code:

pyspark

from json import loadsimport timevault as tvfrom functools import partialfrom pyspark import SparkContext, SparkConf

spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”))sc = SparkContext(conf=spark_conf)

rdd = (sc.textFile(tv.conf.hdfs_files) .map(loads) .flatMap(tv.flatten_message) .flatMap(partial(tv.emit_metrics, metrics=tv.metrics_to_emit)) .filter(lambda tup : tup[2] < float(tv.conf.limit_time)) .mapPartitions(partial(tv.insert_to_kairos, conf=tv.conf) .count())

Example code:

the end resultfrom pandas import DataFrame, to_datetime

series = ktseries.series(metric_name, “months”, transform=transform)ts, fields = zip(*series.items())df = DataFrame({"data” : fields}, index=to_datetime(ts, unit="s"))

building models

First naïve model is simply the mean and standard deviation across all time.

blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit

building models

Second slightly less naïve model is fitting a sine curve to the whole series.

classification

Both naïve models left a lot to be desired. Two simple classifications would help us treat different types of time series appropriately:

Does this metric show a weekly pattern (ie. different behavior on weekends versus weekdays)?

Does this metric show a daily pattern?

classificationFit a sine curve to the weekday and weekend periods.

Ratio of the level of those fits to determine if weekdays will be divided from weekends.

weekly

classification weekly

from scipy.optimize import leastsq

def fitfunc(p, x) : return (p[0] * (1 - p[1] * np.sin(2 * np.pi / (24 * 3600) * (x + p[2]))))

def residuals(p, y, x) : return y - fitfunc(p, x)

def fit(tsdf) : tsgb = tsdf.groupby(tsdf.timeofday).mean() p0 = np.array([tsgb[“conns”].mean(), 1.0, 0.0]) plsq, suc = leastsq(residuals, p0, args=(tsgb[“conns”], np.array(tsgb.index))) return plsq

classification weekly

def weekend_ratio(tsdf) :

tsdf['weekday'] = pd.Series(tsdf.index.weekday < 5, index=tsdf.index) tsdf['timeofday'] = (tsdf.index.second + tsdf.index.minute * 60 + tsdf.index.hour * 3600)

wdayplsq = fit(tsdf[tsdf.weekday == 1]) wendplsq = fit(tsdf[tsdf.weekdy == 0])

return wendplsq[0] / wdayplsq[0]

0 1cutoff 1 / cutoff

No weekly variation.

classification

Weekly pattern.

No weekly pattern.

weekly

classificationTake a Fourier transform of the time series, and inspect the bins associated with a frequency of a day.

Use the ratio of those bins to the first (constant or DC component) in order to classify the time series.

classification

Time series on weekdays shown with a strong daily pattern.

Fourier transform with bins around the day frequency highlighted.

classification

Time series on weekends shown with no daily pattern.

Fourier transform with bins around the day frequency highlighted.

classificationdef daily_ratio(tsdf) :

nbins = len(tsdf) deltat = (tsdf.index[1] - tsdf.index[0]).seconds deltaf = 1.0 / (len(tsdf) * deltat)

daybin = int((1.0 / (24 * 3600)) / deltaf)

rfft = np.abs(np.fft.rfft(tsdf[“conns”])) daily_ratio = np.sum(rfft[daybin - 1:daybin + 2]) / rfft[0] return daily_ratio

Find the bin associated with the frequency of a day using:

Exponentially weighted moving average:

The decay parameter is specified as a span, s, in pandas, related to α by:

α = 2 / (s + 1)

A normal EWMA analysis is done when the metric shows no daily pattern. A stacked EWMA analysis is done when there is a daily pattern.

ewmadef ewma_outlier(tsdf, stdlimit=5, span=15) :

tsdf[’conns_binpred’] = pd.ewma(tsdf[‘conns’], span=span).shift(1) tsdf[’conns_binstd’] = pd.ewmstd(tsdf[‘conns’], span=span).shift(1)

tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[’conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)

return tsdf

normal

ewma normal

blue: actual response sizegreen: prediction windowred: actual value exceeded standard deviation limit

normal

ewma stacked

ewmadef stacked_outlier(tsdf, stdlimit=4, span=10) :

gbdf = tsdf.groupby(‘timeofday’)[colname] gbdf = pd.DataFrame({‘conns_binpred’ : gbdf.apply(pd.ewma, span=span), ‘conns_binstd’ : gbdf.apply(pd.ewmstd, span=span)})

interval = tsdf.timeofday[1] - tsdf.timeofday[0] nshift = int(86400.0 / interval)

gbdf = gbdf.shift(nshift) tsdf = gbdf.combine_first(tsdf)

tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[‘conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)

return tsdf

stacked

Shift the EWMA results by a day and overlay them on the original DataFrame.

stacked

I am currently investigating using ARIMA (autoregressive integrated moving average) models to make better predictions.

I’m not convinced that this level of detail is necessary for the analysis I’m doing, but I wanted to highlight another cool scientific computing library that’s available.

arimafrom statsmodels.tsa.arima_model import ARIMA

def arima_model_forecast(tsdf, p, d q) :

arima_model = ARIMA(tsdf[“conns”][:-1], (p, d, q)).fit() forecast, stderr, conf_int = arima_model.forecast(1)

tsdf[“conns_binpred"][-1] = forecast[0] tsdf[“conns_binstd"][-1] = stderr[0]

return tsdf

p = d = q = 1

takeaways

Python provides simple and usable interfaces to most data handling projects.

Combined, these interfaces can create a full data analysis pipeline from collection to analysis.

time series analysis for network secruity

prediction

pyspark import

day frequency

sc sparkcontext

daily pattern

nave model

weekly pattern

sine curve

Software

network master™ series - dl.cdn-anritsu.com · network...

quidway s series network product

srx series and j series network address translation

network master series - ausoptic

ax series network partition: solution guide [advanced] · -...

axis 207 network camera series

poe+ series switches typical network solution |...

series rlc network

network security series

corporate data secruity best practices and legal compliance...

wifirobin series network product - wifirobin

salesforce secruity model best practice

cisco nss4000 and nss6000 series network storage … ·...

electrical network protection sepam series 20 … series...

cressy - saltpetre state secruity england (2011)

readynas 420 series network attached storage … ·...

private ilte network advantage series

network guide - surecolor f-series and t-series

research professionals network workshop series

poe+ series switch typical network solution |...