vw kitsune: an ensemble of autoencoders for online network ... · in japanese folklore, is a...

15
Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection Yisroel Mirsky, Tomer Doitshman, Yuval Elovici and Asaf Shabtai Ben-Gurion University of the Negev {yisroel, tomerdoi}@post.bgu.ac.il, {elovici, shabtaia}@bgu.ac.il Abstract—Neural networks have become an increasingly popu- lar solution for network intrusion detection systems (NIDS). Their capability of learning complex patterns and behaviors make them a suitable solution for differentiating between normal traffic and network attacks. However, a drawback of neural networks is the amount of resources needed to train them. Many network gateways and routers devices, which could potentially host an NIDS, simply do not have the memory or processing power to train and sometimes even execute such models. More importantly, the existing neural network solutions are trained in a supervised manner. Meaning that an expert must label the network traffic and update the model manually from time to time. In this paper, we present Kitsune: a plug and play NIDS which can learn to detect attacks on the local network, without supervision, and in an efficient online manner. Kitsune’s core algorithm (KitNET) uses an ensemble of neural networks called autoencoders to collectively differentiate between normal and abnormal traffic patterns. KitNET is supported by a feature extraction framework which efficiently tracks the patterns of every network channel. Our evaluations show that Kitsune can detect various attacks with a performance comparable to offline anomaly detectors, even on a Raspberry PI. This demonstrates that Kitsune can be a practical and economic NIDS. KeywordsAnomaly detection, network intrusion detection, on- line algorithms, autoencoders, ensemble learning. I. I NTRODUCTION The number of attacks on computer networks has been increasing over the years [1]. A common security system used to secure networks is a network intrusion detection system (NIDS). An NIDS is a device or software which monitors all traffic passing a strategic point for malicious activities. When such an activity is detected, an alert is generated, and sent to the administrator. Conventionally an NIDS is deployed at a single point, for example, at the Internet gateway. This point deployment strategy can detect malicious traffic entering and leaving the network, but not malicious traffic traversing the network itself. To resolve this issue, a distributed deployment strategy can be used, where a number of NIDSs are be connected to a set of strategic routers and gateways within the network. Over the last decade many machine learning techniques have been proposed to improve detection performance [2], [3], [4]. One popular approach is to use an artificial neural network (ANN) to perform the network traffic inspection. The benefit of using an ANN is that ANNs are good at learning complex non-linear concepts in the input data. This gives ANNs a great advantage in detection performance with respect to other machine learning algorithms [5], [2]. The prevalent approach to using an ANN as an NIDS is to train it to classify network traffic as being either normal or some class of attack [6], [7], [8]. The following shows the typical approach to using an ANN-based classifier in a point deployment strategy: 1) Have an expert collect a dataset containing both normal traffic and network attacks. 2) Train the ANN to classify the difference between normal and attack traffic, using a strong CPU or GPU. 3) Transfer a copy of the trained model to the net- work/organization’s NIDS. 4) Have the NIDS execute the trained model on the observed network traffic. In general, a distributed deployment strategy is only prac- tical if the number of NIDSs can economically scale according to the size of the network. One approach to achieve this goal is to embed the NIDSs directly into inexpensive routers (i.e., with simple hardware). We argue that it is impractical to use ANN-based classifiers with this approach for several reasons: Offline Processing. In order to train a supervised model, all labeled instances must be available locally. This is infeasible on a simple network gateway since a single hour of traffic may contain millions of packets. Some works propose offloading the data to a remote server for model training [9] [3]. However, this solution may incur significant network overhead, and does not scale. Supervised Learning. The labeling process takes time and is expensive. More importantly, what is considered to be normal depends on the local traffic observed by the NIDS. Further- more, in attacks change overtime and while new ones are constantly being discovered [10], so continuous maintainable of a malicious attack traffic repository may be impractical. Finally, classification is a closed-world approach to identifying concepts. In other words, a classifier is trained to identify the classes provided in the training set. However, it is unreasonable to assume that all possible classes of malicious traffic can be collected and placed in the training data. High Complexity. The computational complexity of an ANN Permission to freely reproduce all or part of this paper for noncommercial purposes is granted provided that copies bear this notice and the full citation on the first page. Reproduction for commercial purposes is strictly prohibited without the prior written consent of the Internet Society, the first-named author (for reproduction of an entire paper only), and the author’s employer if the paper was prepared within the scope of employment. NDSS ’18, 18-21 February 2018, San Diego, CA, USA Copyright 2018 Internet Society, ISBN 1-1891562-49-5 http://dx.doi.org/10.14722/ndss.2018.23204 arXiv:1802.09089v2 [cs.CR] 27 May 2018

Upload: dinhhuong

Post on 10-Nov-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

Kitsune: An Ensemble of Autoencoders for OnlineNetwork Intrusion Detection

Yisroel Mirsky, Tomer Doitshman, Yuval Elovici and Asaf ShabtaiBen-Gurion University of the Negev

{yisroel, tomerdoi}@post.bgu.ac.il, {elovici, shabtaia}@bgu.ac.il

Abstract—Neural networks have become an increasingly popu-lar solution for network intrusion detection systems (NIDS). Theircapability of learning complex patterns and behaviors make thema suitable solution for differentiating between normal traffic andnetwork attacks. However, a drawback of neural networks isthe amount of resources needed to train them. Many networkgateways and routers devices, which could potentially host anNIDS, simply do not have the memory or processing power totrain and sometimes even execute such models. More importantly,the existing neural network solutions are trained in a supervisedmanner. Meaning that an expert must label the network trafficand update the model manually from time to time.

In this paper, we present Kitsune: a plug and play NIDSwhich can learn to detect attacks on the local network, withoutsupervision, and in an efficient online manner. Kitsune’s corealgorithm (KitNET) uses an ensemble of neural networks calledautoencoders to collectively differentiate between normal andabnormal traffic patterns. KitNET is supported by a featureextraction framework which efficiently tracks the patterns ofevery network channel. Our evaluations show that Kitsune candetect various attacks with a performance comparable to offlineanomaly detectors, even on a Raspberry PI. This demonstratesthat Kitsune can be a practical and economic NIDS.

Keywords—Anomaly detection, network intrusion detection, on-line algorithms, autoencoders, ensemble learning.

I. INTRODUCTION

The number of attacks on computer networks has beenincreasing over the years [1]. A common security system usedto secure networks is a network intrusion detection system(NIDS). An NIDS is a device or software which monitors alltraffic passing a strategic point for malicious activities. Whensuch an activity is detected, an alert is generated, and sent tothe administrator. Conventionally an NIDS is deployed at asingle point, for example, at the Internet gateway. This pointdeployment strategy can detect malicious traffic entering andleaving the network, but not malicious traffic traversing thenetwork itself. To resolve this issue, a distributed deploymentstrategy can be used, where a number of NIDSs are beconnected to a set of strategic routers and gateways withinthe network.

Over the last decade many machine learning techniqueshave been proposed to improve detection performance [2], [3],[4]. One popular approach is to use an artificial neural network(ANN) to perform the network traffic inspection. The benefitof using an ANN is that ANNs are good at learning complexnon-linear concepts in the input data. This gives ANNs agreat advantage in detection performance with respect to othermachine learning algorithms [5], [2].

The prevalent approach to using an ANN as an NIDS isto train it to classify network traffic as being either normalor some class of attack [6], [7], [8]. The following shows thetypical approach to using an ANN-based classifier in a pointdeployment strategy:

1) Have an expert collect a dataset containing both normaltraffic and network attacks.

2) Train the ANN to classify the difference between normaland attack traffic, using a strong CPU or GPU.

3) Transfer a copy of the trained model to the net-work/organization’s NIDS.

4) Have the NIDS execute the trained model on the observednetwork traffic.

In general, a distributed deployment strategy is only prac-tical if the number of NIDSs can economically scale accordingto the size of the network. One approach to achieve this goalis to embed the NIDSs directly into inexpensive routers (i.e.,with simple hardware). We argue that it is impractical to useANN-based classifiers with this approach for several reasons:

Offline Processing. In order to train a supervised model, alllabeled instances must be available locally. This is infeasibleon a simple network gateway since a single hour of traffic maycontain millions of packets. Some works propose offloadingthe data to a remote server for model training [9] [3]. However,this solution may incur significant network overhead, and doesnot scale.

Supervised Learning. The labeling process takes time and isexpensive. More importantly, what is considered to be normaldepends on the local traffic observed by the NIDS. Further-more, in attacks change overtime and while new ones areconstantly being discovered [10], so continuous maintainableof a malicious attack traffic repository may be impractical.Finally, classification is a closed-world approach to identifyingconcepts. In other words, a classifier is trained to identify theclasses provided in the training set. However, it is unreasonableto assume that all possible classes of malicious traffic can becollected and placed in the training data.

High Complexity. The computational complexity of an ANN

Permission to freely reproduce all or part of this paper for noncommercialpurposes is granted provided that copies bear this notice and the full citationon the first page. Reproduction for commercial purposes is strictly prohibitedwithout the prior written consent of the Internet Society, the first-named author(for reproduction of an entire paper only), and the author’s employer if thepaper was prepared within the scope of employment.NDSS ’18, 18-21 February 2018, San Diego, CA, USACopyright 2018 Internet Society, ISBN 1-1891562-49-5http://dx.doi.org/10.14722/ndss.2018.23204

arX

iv:1

802.

0908

9v2

[cs

.CR

] 2

7 M

ay 2

018

Page 2: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

RM

SE

RM

SER

MSER

MSER

MSE

RM

SER

MSER

MSE

Map

Ensemble Layer Output Layer

score

Fig. 1: An illustration of Kitsune’s anomaly detection algo-rithm KitNET.

grows exponentially with number of neurons [11]. This meansthat an ANN which is deployed on a simple network gateway,is restricted in terms of its architecture and number of inputfeatures which it can use. This is especially problematic ongateways which handle high velocity traffic.

In light of the challenges listed above, we suggest thatthe development of an ANN-based network intrusion detector,which is to be deployed and trained on routers in a distributedmanner, should adhere to the following restrictions:

Online Processing. After the training or executing the modelwith an instance, the instance is immediately discarded. Inpractice, a small number of instances can be stored at anygiven time, as done in stream clustering [12].

Unsupervised Learning. Labels, which indicate explicitlywhether a packet is malicious or benign, are not used in thetraining process. Other meta information can be used so longas acquiring the information does not delay the process.

Low Complexity. The packet processing rate must exceed theexpected maximum packet arrival rate. In other words, wemust ensure that there is no queue of packets awaiting to beprocessed by the model.

In this paper, we present Kitsune: a novel ANN-basedNIDS which is online, unsupervised, and efficient. A Kitsune,in Japanese folklore, is a mythical fox-like creature that has anumber of tails, can mimic different forms, and whose strengthincreases with experience. Similarly, Kitsune has an ensembleof small neural networks (autoencoders), which are trainedto mimic (reconstruct) network traffic patterns, and whoseperformance incrementally improves overtime.

The architecture of Kitsune’s anomaly detection algorithm(KitNET) is illustrated in Fig. 1. First, the features of aninstance are mapped to the visible neurons of the ensemble.Next, each autoencoder attempts to reconstruct the instance’sfeatures, and computes the reconstruction error in terms ofroot mean squared errors (RMSE). Finally, the RMSEs areforwarded to an output autoencoder, which acts as a non-linearvoting mechanism for the ensemble. We note that while train-ing Kitsune, no more than one instance is stored in memory ata time. KitNET has one main parameter, which is the maximumnumber of inputs for any given autoencoder in the ensemble.This parameter is used to increase the algorithm’s speed witha modest trade off in detection performance.

The reason we use autoencoders is because (1) they cantrained in an unsupervised manner, and (2) they can be used foranomaly detection in the event of a poor reconstruction. Thereason we propose using an ensemble of small autoencoders,is because they are more efficient and can be less noisier thana single autoencoder over the same feature space. From ourexperiments, we found that Kitsune can increase the packetprocessing rate by a factor of five, and provide a detectionperformance which rivals other an offline (batch) anomalydetectors.

In summary, the contributions of this paper as follows:

• A novel autoencoder-based NIDS for simple network de-vices (Kitsune), which is lightweight and plug-and-play.To the best of our knowledge, we are the first to proposethe use of autoencoders with or without ensembles foronline anomaly detection in computer networks. We alsopresent the core algorithm (KitNET) as a generic onlineunsupervised anomaly detection algorithm, and providethe source code for download.1

• A feature extraction framework for dynamically main-taining and extracting implicit contextual features fromnetwork traffic. The framework has a small memoryfootprint since the statistics are updated incrementallyover damped windows.

• An online technique for automatically constructing theensemble of autoencoders (i.e., mapping features to ANNinputs) in an unsupervised manner. The method involvesthe incremental hierarchal clustering of the feature-space(transpose of the unbounded dataset), and bounding ofcluster sizes.

• Experimental results on an operational IP camera videosurveillance network, IoT network, and a wide variety ofattacks. We also demonstrate the algorithm’s efficiency,and ability to run on a simple router, by performingbenchmarks on a Raspberry PI.

The rest of the paper is organized as follows: SectionII discusses related work in the domain of online anomalydetection. Section III provide a background on autoencodersand how they work. Section IV presents Kitsune’s frameworkand it’s entire machine learning pipeline. Section V presentsexperimental results in terms of detection performance andrun-time performance. Finally, in section VII we present ourconclusion.

II. RELATED WORK

The domain of using machine learning (specificallyanomaly detection) for implementing NIDSs was extensivelyresearched in the past [13], [14], [15], [16], [17]. However,these solutions usually do not have any assumption on theresources of the machine running training or executing themodel, and therefore are either too expensive to train andexecute on simple gateways, or require a labeled dataset toperform the training process.

Several previous works have proposed online anomalydetection mechanisms using different lightweight algorithms.For example, the PAYL IDS which models simple histograms

1The source code for KitNET is available for download at:https://github.com/ymirsky/KitNET-py.

2

Page 3: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

of packet content [18] or the kNN algorithm [19]. Thesemethods are either very simple and therefore produce verypoor results, or require accumulating data for the training ordetection.

A popular algorithm for network intrusion detection isthe ANN. This is because of its ability to learn complexconcepts, as well as the concepts from the domain of networkcommunication [17]. In [20], the authors evaluated the ANN,among other classification algorithms, in the task of networkintrusion detection, and proposed a solution based on anensemble of classifiers using connection-based features. In [8],the authors presented a modification to the back propagationalgorithm to increase the speed of an ANN’s training process.In [7], the authors used multiple ANN-based classifiers, whereeach one was trained to detect a specific type of attack. In [9],the authors proposed a hierarchal method where each packetfirst passes through an anomaly detection model, then if ananomaly is raised, the packet is evaluated by a set of ANNclassifiers where each classifier is trained to detect a specificattack type.

All of the aforementioned papers which use ANNs, areeither supervised, or are not suitable for a simple networkgateway. In addition, some of the works assume that thetraining data can be stored and accumulated which is not thecase for simple network gateways. Our solution enables a plug-and-play deployment which can operate at much faster speedsthan the aforementioned models.

With regards to the use of autoencoders: In [21], theauthors used an ensemble of deep neural networks to addressobject tracking in the online setting. Their proposed methoduses a stacked denoising autoencoder (SDAE). Each layer ofthe SDAE serves as a different feature space for the rawimage data. The scheme transforms each layer of the SDAEto a deep neural network which is used as discriminativebinary classifier. Although the authors apply autoencoders inan online setting, they did not perform anomaly detection, noraddress the challenge of real-time processing (which is greatchallenge with deep neural networks). Furthermore, traininga deep neural network is complex and cannot be practicallyperformed on a simple network device. In [22] and [23], theauthors propose the use of autoencoders to extract featuresfrom datasets in order to improve the detection of cyberthreats. However, the autoencoders themselves were not usedfor anomaly detection. Ultimately, the authors use classifiersto detect the cyber threats. Therefore, their solution requires anexpert to label instances, whereas our solution is unsupervised,and plug-and-play.

In [24], the authors proposed the generic use of an au-toencoder for detecting anomalies. In [19], the authors useautoencoders to detect anomalies in power grids. These worksdiffer from ours because (1) they are not online, (2) the archi-tecture used by the authors is not lightweight and scalable asan ensemble, and (3) has not been applied to network intrusiondetection. We note that part of this paper’s contribution is anappropriate feature extraction framework, which enables theuse of autoencoders in the online network setting.

III. BACKGROUND: AUTOENCODERS

Autoencoders are the foundation building blocks of Kit-sune. In this section we provide a brief introduction to au-toencoders; what they are, and how they work. To describethe training and execution of an auto encoder we will refer tothe example in Fig. 2.

+1+1

റ𝑥∈ℝ3 𝑥1

𝑥2

𝑥3

ෞ𝑥1

𝑙1 𝑙3𝑙2

𝑊 = 𝑊(1),𝑊(2)

𝑏 = 𝑏(1) , 𝑏(2)

ෞ𝑥2

ෞ𝑥3

= ℎ𝑊,𝑏 റ𝑥

where, 𝑊 2 = 𝑊(1)𝑇

Learned Parameters:

Fig. 2: An example autoencoder with one compression layer,which reconstructs instances with three features.

A. Artificial Neural Networks

ANNs are made up of layers of neurons, where eachlayer is connected sequentially via synapses. The synapseshave associated weights which collectively define the conceptslearned by the model. Concretely, let l(i) denote the i-th layerin the ANN, and let ‖l(i)‖ denote the number of neurons in l(i).Finally, let the total number of layers in the ANN be denoted asL. The weights which connect l(i) to l(i+1) are denoted as the‖l(i)‖-by-‖l(i+1)‖ matrix W (i) and ‖l(i+1)‖ dimensional biasvector ~b(i). Finally, we denote the collection of all parametersθ as the tuple θ ≡ (W, b), where W and b are the weights ofeach layer respectively. Fig 2 illustrates how the weights formthe synapses of each layer in an ANN.

There are two kinds of layers in an ANN: visible layersand hidden layers. The visible layer receives the input instance~x with an additional bias variable (a constant value of 1). ~xis a vector of numerical features which describes the instance,and is typically normalized to fall out approximately on therange of [−1,+1] (e.g., using 0-1 normalization or zscorenormalization)[25]. The difference between the visible layerand the hidden layers is that the visible layer is considered tobe precomputed, and ready to be passed to the second layer.

B. Executing an ANN

To execute an ANN, l(2) is activated with the output of l(1)(i.e., ~x) weighted with W (1), then the output l(2) weighted withW (2) is used to activate l(3), and so on until the final layer hasbeen activated. This process is known as forward-propagation.Let ~a(i) be the ‖l(i)‖ vector of outputs from the neurons in l(i).To obtain a(i+1), we pass ~a(i) through l(i+1) by computing

a(i+1) = f(W (i) · ~a(i) + ~b(i)

)(1)

where f is what’s known as the neuron’s activation function.A common activation function, and what we use in Kitsune,is the sigmoid function, defined as

f (~x) =1

1 + e~x(2)

3

Page 4: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

Algorithm 1: The back-propagation algorithm forperforming batch-training of an ANN.

procedure: trainGD(θ,X, Y,max iter)1 θ ← U(− 1

‖l(1)‖ ,1

‖l(1)‖ ). . random initialization2 cur iter ← 03 while cur iter ≤ max iter do4 A, Y ′ ← hθ(X) . forward propagation5 deltas← bθ(Y, Y

′) . backward propagation6 θ ← GD`(A, deltas) . weight update7 cur iter++8 end9 return θ

Finally, we define output of the last layer to be denoted as~y′ = a(L). Let the function h be the full layer-wise forward-propagation from the input ~x until the final output ~y′, denoted

hθ(~x) = ~y′ (3)

C. Training an ANN

The output of an ANN depends on the training set, and thetraining algorithm. A training dataset is composed of instances~x ∈ X and the respective expected outputs ~y ∈ Y (e.g., thelabels in classification). During training, the weights of theANN are tuned so that hθ(~x) = ~y. A common algorithm fortraining an ANN (i.e., finding the optimum W and ~b givenX,Y ) is known as the back-propagation algorithm [6].

The back-propagation algorithm involves a step where theerror between the predicted y′ and the expected y is propagatedfrom the output to each neuron. During this process, the errorsbetween a neuron’s actual and expected activations are stored.We denote this backward-propagation step as the functionbθ(Y, Y

′). Given some execution of hθ, let A be every neuron’sactivation and let Delta be the activation errors of all neurons.

Given the current A and Delta, and a set learning rate ` ∈(0, 1], we can incrementally optimize W and b by performingthe Gradient Descent (GD) algorithm [26]. All together, theback-propagation algorithm is as follows:

The above process is referred to as batch-training. This isbecause for each iteration, GD updates the weights accord-ing to the collective errors of all instances in X . Anotherapproach is stochastic-training, where Stochastic Gradient De-scent (SGD) is used instead of GD. In SGD, the weights areupdated according to the errors of each instance individually.With SGD, the back propagation algorithm becomes:

The difference between GD and SGD is that GD convergeson an optimum better than SGD, but SGD initially convergesfaster [27]. For a more elaborate explanation of the back-propagation algorithm, we refer the reader to [26].

In Kitsune, we use SGD with a max iter of 1. In otherwords, while we are in the training phase, if a new instancearrives we perform a single iteration of the inner loop inAlgorithm 2, discard the instance, and then wait for the next.This way we learn only once from each observed instance, andremain an online algorithm.

Algorithm 2: The back-propagation algorithm forperforming stochastic-training of an ANN.

procedure: trainSGD(θ,X, Y,max iter)1 θ ← U(− 1

‖l(1)‖ ,1

‖l(1)‖ ). . random initialization2 cur iter ← 03 while cur iter ≤ max iter do4 for xi in ‖X‖ do5 A, y′ ← hθ(xi) . forward propagation6 deltas← bθ(~y, ~y′) . backward propagation7 θ ← GD`(A, deltas) . weight update8 end9 cur iter++

10 end11 return θ

D. Autoencoders

An autoencoder is an artificial neural network which istrained to reconstruct it’s inputs (i.e., X = Y ). Concretely,during training, an autoencoder tries to learn the function

hθ(~x) ≈ ~x (4)

It can be seen that an autoencoder is essentially trying tolearn the identity function of the original data distribution.Therefore, constraints are placed on the network, forcing itto learn more meaningful concepts and relationships betweenthe features in ~x. The most common constraint is to limit thenumber of neurons in the inner layers of the network. Thenarrow passage causes the network to learn compact encodingsand decodings of the input instances.

As an example, Fig. 2 illustrates an autoencoder whichreceives an instance ~x ∈ R3 at layer l(1), (2) encodes(compresses) ~x at layer l(2), and (3) decodes the compressedrepresentation of ~x at layer L(3). If an autoencoder is sym-metric in layer sizes, the same (mirrored) weights can be usedfor coding and decoding [28]. This trick reduces the numberof calculations needed during training. For example, in Fig 2,W (2) = [W (1)]T .

E. Anomaly Detection with Autoencoders

Autoencoders have been used for many different machinelearning tasks. For example, generating new content [29], andfiltering out noise from images [30]. In this paper, we areinterested in using autoencoders for anomaly detection.

In general, an autoencoder trained on X gains the ca-pability to reconstruct unseen instances from the same datadistribution as X . If an instance does not belong to theconcepts learned from X , then we expect the reconstructionto have a high error. The reconstruction error of the instance~x for a given autoencoder, can be computed by taking the rootmean squared error (RMSE) between ~x and the reconstructedoutput ~y′. The RMSE between two vectors is defined as

RMSE (~x, ~y) =

√∑ni=1 (xi − yi)2

n(5)

where n is the dimensionality of the input vectors.

4

Page 5: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

Let φ be the anomaly threshold, with an initial value of−1, and let β ∈ [1,∞) be some given sensitivity parameter.One can apply an autoencoder to the task of anomaly detectionby performing the following steps:

1) Training Phase: Train an autoencoder on clean (normal)data. For each instance xi in the training set X:

a) Execute: s = RMSE (~x, hθ(~x))b) Update: if(s ≥ φ) then φ← sc) Train: Update θ by learning from xi

2) Execution Phase:When an unseen instance ~x arrives:

a) Execute: s = RMSE (~x, hθ(~x))b) Verdict: if(s ≥ φβ) then Alert

The process in which Kitsune performs anomaly detectionover an ensemble of autoencoders will be detailed later insection IV-D

F. Complexity

In order to activate the layer l(i+1), one must perform thematrix multiplication W (i) ·~a(i) as described in (1). Therefore,the complexity of activating layer l(i+1) is O

(l(i) · l(i+1)

).2

Therefore, the total complexity of executing an ANN is de-pendent on the number of layers, and the number of neuronsin each layer. The complexity of training an ANN on asingle instance using SDG (Algorithm 2) is roughly doublethe complexity of execution. This is because of the backward-propagation step.

We note that autoencoders can be deep (have many hiddenlayers). In general, deeper and wider networks can learnconcepts which are more complex. However, as shown above,deep networks can be computationally expensive to trainand execute. This is why in KitNET we ensure that eachautoencoder is limited to three layers with at most seven visibleneurons.

IV. THE KITSUNE NIDS

In this section we present the Kitsune NIDS: the packetpreprocessing framework, the feature extractor, and the coreanomaly detection algorithm. We also discuss the complexityof the anomaly detection algorithm and provide a bound onits runtime performance.

A. Overview

Kitsune is a plug-and-play NIDS, based on neural net-works, and designed for the efficient detection of abnormalpatterns in network traffic. It operates by (1) monitoring thestatistical patterns of recent network traffic, and (2) detectinganomalous patterns via an ensemble of autoencoders. Eachautoencoder in the ensemble is responsible for detectinganomalies relating to a specific aspect of the network’s be-havior. Since Kitsune is designed to run on simple networkrouters, and in real-time, Kitsune has been designed with smallmemory footprint and a low computational complexity.

2The modern approach is to accelerate these operations using a GPU.However, in this paper, we assume that no GPU is available. This is thecase for a simple network router.

Kitsune’s framework is composed of the following com-ponents:

• Packet Capturer: The external library responsible for ac-quiring the raw packet. Example libraries: NFQueue[31],afpacket[32], and tshark[33] (Wireshark’s API).

• Packet Parser: The external library responsible for pars-ing raw packets to obtain the meta information requiredby the Feature Extractor. Example libraries: Packet++3,and tshark.

• Feature Extractor (FE): The component responsible forextracting n features from the arriving packets to createcreating the instance ~x ∈ Rn. The features of ~x describethe a packet, and the network channel from which it came.

• Feature Mapper (FM): The component responsible forcreating a set of smaller instances (denoted as v) from ~x,and passing v to the in the Anomaly Detector (AD). Thiscomponent is also responsible for learning the mapping,from ~x to v.

• Anomaly Detector (AD): The component responsible fordetecting abnormal packets, given a packet’s representa-tion v.

Since the Packet Capturer and Packet Extractor are not thecontributions of this paper, we will focus on the FE, FM, andAD components. We note that FM and AD components aretask generic (i.e., solely depend on the input features), andtherefore can be reapplied as a generic online anomaly detec-tion algorithm. Moreover, we refer to the generic algorithm inthe AD component as KitNET.

KitNET has one main input parameter, m: the maximumnumber of inputs for each autoencoder in KitNET’s ensemble.This parameter affects the complexity of the ensemble inKitNET. Since m involves a trade-off between detection andruntime performance, the user of Kitsune must decide whatis more important (detection rate vs packet processing rate).This trade-off is further discussed later in section V.

The FM and AD have two modes of operation: train-modeand exec-mode. For both components, train-mode transitionsinto exec-mode after some user defined time limit. A compo-nent in train-mode updates its internal variables with the giveninputs, but does not generate outputs. Conversely, a componentin exec-mode does not update its variables, but does produceoutputs.

In order to better understand how Kitsune works, we willnow describe the process which occurs when a packet acquired,as depicted in Fig. 3:

1) The Packet Capturer acquires a new packet and passesthe raw binary to the Packer Parser.

2) The Packet Parser receives the raw binary, parses thepacket, and sends the meta information of the packet tothe FE. For example, the packet’s arrival time, size, andnetwork addresses.

3) The FE receives this information, and uses it to retrieveover 100 statistics which are used to implicitly describethe current state of the channel from which the packetcame. These statistics form the instance ~x ∈ Rn, whichis passed to the FM.

3The Packet++ project can be found on GitHub:https://github.com/seladb/PcapPlusPlus

5

Page 6: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

RMSE

RMSERMSERMSERMSE

RMSERMSERMSE

Ensemble Layer Output Layer

score

Log

Feature Extractor (FE) Feature Mapper (FM) Anomaly Detector (AD)

DampedIncremental

Statistics

Packet ParserPacket Capturer

KitsuneExternal Libs

Fig. 3: An illustration of Kitsune’s Architecture.

4) The FM receives ~x...• Train-mode: ...and uses ~x to learn a feature map. The

map groups features of ~x into sets with a maximumsize of m each. Nothing is passed to the AD until themap is complete. At the end of train-mode, the map ispassed to the AD, and the AD uses the map to buildthe ensemble architecture (each set forms the inputs toan autoencoder in the ensemble).• Exec-mode: ...and the learned mapping is used to create

a collection of small instances v from ~x, which is thenpassed to the respective autoencoders in the ensemblelayer of the AD.

5) The AD receives v...• Train-mode: ..if and uses v to train the ensemble layer.

The RMSE of the forward-propagation is then used totrain the output layer. The largest RMSE of the outputlayer is set as φ and stored for later use.• Exec-mode: ...and executes v across all layers. If the

RMSE of the output layer exceeds φβ, then an alert islogged with packet details.

6) The original packet, ~x, and v are discarded.

We discuss how the FE, FM, and AD components work ingreater detail.

B. Feature Extractor (FE)

Feature extraction is the process of obtaining or engineer-ing a vector of values which describe a real world observation.In network anomaly detection, it is important to extract fea-tures which capture the context and purpose of each packettraversing the network. For example, consider a single TCPSYN packet. The packet may be a benign attempt to establisha connection with a server, or it may be one of millions ofsimilar packets sent in an attempt to cause a denial of serviceattack (DoS). As another example, consider a video streamsent from an IP surveillance camera. Although the contentsof the packets are legitimate, there may suddenly appear aconsistently significant rise in jitter. This may indicate thetraffic is being sniffed in a man-in-the-middle attack.

These are just some example of attacks where temporal-statistical features could help detect anomalies. The challengewith extracting these kinds of features from network traffic isthat (1) packets from different channels (conversations) areinterleaved, (2) there can be many channels at any givenmoment, (3) the packet arrival rate can be very high. Thenaive approach is to maintain a window of packets from eachchannel, and to continuously compute statistics over those

windows. However, it clear how this can become impracticalin terms of memory, and doesn’t scale very well.

For this reason, we designed a framework for high speedfeature extraction of temporal statistics, over a dynamic num-ber of data streams (network channels). The framework hasa small memory footprint since it uses incremental statisticsmaintained over a damped window. Using a damped windowmeans that the extracted features are temporal (capture the re-cent behavior of the packet’s channel), and that an incrementalstatistic can be deleted when its dampening weight becomeszero (saving additional memory). The framework has a O(1)complexity because the collection of incremental statistics ismaintained in a hash table. The framework also maintainsuseful 2D statistics which capture the relationship between therx and tx traffic of a connection.

We will now briefly describe how damped incrementalstatistics work. Afterwards, we will enumerate the statisticalfeatures extracted by the FE to produce the instance ~x.

1) Damped Incremental Statistics: Let S = {x1, x2, . . .}be an unbounded data stream where xi ∈ R. For example,S can be a sequence of observed packet sizes. The mean,variance, and standard deviation of S can be updated incre-mentally by maintaining the tuple IS := (N,LS, SS), whereN ,LS, and SS are the number, linear sum, and squared sumof instances seen so far. Concretely, the update procedure forinserting xi into IS is IS ← (N+1, LS+xi, SS+x2i ), and thestatistics at any given time are µS = LS

N , σ2S = |SSN −

(LSN

)2 |,and σS =

√σ2S .

In order to extract the current behavior of a data stream, wemust forget older instances. The naive approach is to maintaina sliding window of values. However, this approach has amemory and runtime complexity of O(n), in contrast to O(1)for an incremental statistic. Furthermore, the sliding windowapproach does not consider the amount of time spanned bythe window. For example, the last 100 instances could havearrived over the last hour or in the last few seconds.

The solution to this is to use damped incremental statistics.In a damped window model, the weight of older values areexponentially decreased over time. Let d be the decay functiondefined as

dλ(t) = 2−λt (6)

where λ > 0 is the decay factor, and t is the timeelapsed since the last observation from stream Si. The tupleof a damped incremental statistic is defined as ISi,λ :=(w,LS, SS, SRij , Tlast), where w is the current weight, Tlast

6

Page 7: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

TABLE I: Summary of the incremental statistics which can becomputed from Si and Sj .

Channel

Type

𝝀

(windows)

Extracted

Statistics

Stats from

all windows Description: All traffic… Example

MI 5, 3, 1,

0.1, 0.01 𝑤𝑖, 𝜇𝑖 , 𝜎𝑖

15 …originating from the same MAC and IP address (00:1C:B3:09:85:15, 192.168.0.2) →*

H 15 …originating from an IP address 192.168.0.2→*

HHjit 15 …from one IP to another (delta in arrival times) 192.168.0.2→192.168.0.5

HH 5, 3, 1,

0.1, 0.01

𝑤𝑖, 𝜇𝑖 , 𝜎𝑖, ‖𝑆𝑖, 𝑆𝑗‖,

𝑅𝑆𝑖,𝑆𝑗, 𝐶𝑜𝑣𝑆𝑖,𝑆𝑗

, 𝑃𝑆𝑖,𝑆𝑗

35 …shared between two IP addresses 192.168.0.2↔192.168.0.5

HpHp 35 …shared between two network sockets 192.168.0.2:23523↔192.168.0.5:80

The packet’s… Statistics Aggregated by # Features Description of the Statistics

…size 𝜇𝑖 , 𝜎𝑖 SrcMAC-IP, SrcIP, Channel, Socket 8 Bandwidth of the outbound traffic

…size ‖𝑆𝑖, 𝑆𝑗‖, 𝑅𝑆𝑖,𝑆𝑗, 𝐶𝑜𝑣𝑆𝑖,𝑆𝑗

, 𝑃𝑆𝑖,𝑆𝑗 Channel, Socket 8

Bandwidth of the

outbound and inbound traffic together

…count 𝑤𝑖 SrcMAC-IP, SrcIP, Channel, Socket 4 Packet rate of the outbound traffic

…jitter 𝑤𝑖, 𝜇𝑖 , 𝜎𝑖 Channel 3 Inter-packet delays of the outbound traffic

Type Statistic Notation Calculation

1D

Weight 𝑤 𝑤

Mean 𝜇𝑆𝑖 𝐿𝑆 𝑤⁄

Std. 𝜎𝑆𝑖 √|𝑆𝑆 𝑤⁄ − (𝐿𝑆 𝑤⁄ )2|

2D

Magnitude ‖𝑆𝑖 , 𝑆𝑗‖ √𝜇𝑆𝑖

2 + 𝜇𝑆𝑗

2

Radius 𝑅𝑆𝑖,𝑆𝑗 √(𝜎𝑆𝑖

2 )2

+ (𝜎𝑆𝑗

2 )2

Approx.

Covariance 𝐶𝑜𝑣𝑆𝑖,𝑆𝑗

𝑆𝑅𝑖𝑗

𝑤𝑖 + 𝑤𝑗

Correlation

Coefficient 𝑃𝑆𝑖,𝑆𝑗

𝐶𝑜𝑣𝑆𝑖,𝑆𝑗

𝜎𝑆𝑖 𝜎𝑆𝑗

is the timestamp of the last update of ISi,λ, and SRij is thesum of residual products between streams i and j (used forcomputing 2D statistics). To update ISλ with xcur at timetcur, Algorithm 3 is performed.

Table I provides a list of the statistics which can becomputed from the incremental statistic ISi,λ. We refer to thestatistics whose computation involves one and two incrementalstatistics as 1D and 2D statistics respectively.

Algorithm 3: The algorithm for inserting a new valueinto a damped incremental statistic.

procedure: update(ISi,λ,xcur,tcur,rj)1 γ ← dλ(tcur − tlast) . Compute decay factor2 ISi,λ ← (γw, γLS, γSS, γSR, Tcur) . Process decay3 ISi,λ ← (w+1, LS+xcur, SS+x2i , SRij+rirj , Tcur)

. Insert value4 return ISi,λ

2) Features Extracted for Kitsune: Whenever a packetarrives, we extract a behavioral snapshot of the hosts andprotocols which communicated the given packet. The snapshotconsists of 115 traffic statistics capturing a small temporalwindow into: (1) the packet’s sender in general, and (2) thetraffic between the packet’s sender and receiver.

Specifically, the statistics summarize all of the traffic...

• ...originating from this packet’s source MAC and IPaddress (denoted SrcMAC-IP).

• ...originating from this packet’s source IP(denoted SrcIP).

• ...sent between this packet’s source and destination IPs(denoted Channel).

• ...sent between this packet’s source and destinationTCP/UDP Socket (denoted Socket).

A total of 23 features (capturing the above) can be extractedfrom a single time window λ (see Table II). The FE extractsthe same set of features from a total of five time windows:100ms, 500ms, 1.5sec, 10sec, and 1min into the past (λ =5, 3, 1, 0.1, 0.01), thus totaling 115 features.

We note that not every packet applies to every channel type(e.g., there is no socket if the packet does not contain a TCPor UDP datagram). In these cases, these features are zeroed.Thus, the final feature vector ~x, which the FE passes to theFM, is always a member of Rn, where n = 115.

C. Feature Mapper (FM)

The purpose of the FM is to map ~x’s n features (dimen-sions) into k smaller sub-instances, one sub-instance for eachautoencoder in the Ensemble Layer of the AD. Let v denotethe ordered set of k sub-instances, where

v = {~v1, ~v2, · · · , ~vk} (7)

We note that the sub-instances of v can be viewed as subspacesof ~x’s domain X .

In order to ensure that the ensemble in the AD operateseffectively and with a low complexity, we require that theselected mapping f(~x) = v:

1) Guarantee that each ~vi has no more than m features,where m is a user defined parameter of the system.The parameter m affects the collective complexity of theensemble (see section IV-E).

2) Map each of the n features in ~x exactly once to thefeatures in v. This is to ensure that the ensemble is nottoo wide.

3) Contain subspaces of X which capture the normal behav-ior well enough to detect anomalous events occurring inthe respective subspaces.

4) Be discovered in a process which is online, so that nomore than one at time is stored in memory.

To respect the above requirements, we find the mappingf by incrementally clustering the features (dimensions) of Xinto k groups which are no larger than m. We accomplishthis by performing agglomerative hierarchal clustering onincrementally updated summary data.

More precisely, the feature mapping algorithm of the FEperforms the following steps:

1) While in train-mode, incrementally update summarystatistics with features of instance ~x.

2) When train-mode ends, perform hierarchal clustering onthe statistics to form f .

3) While in execute-mode, perform f(~xt) = v, and pass vto the AD.

In order to ensure that the grouped features capture normalbehavior, in the clustering process, we use correlation as thedistance measure between two dimensions. In general, thecorrelation distance dcor between two vectors u and v isdefined as

dcor(u, v) = 1− (u− u) · (v − v)

‖(u− u)‖2‖(v − v)‖2(8)

where u is the mean of the elements in vector u, and u · v isthe dot product.

We will now explain how the correlation distances betweenfeatures can be summarized incrementally for the purpose ofclustering. Let nt be the number of instances seen so far. Let

7

Page 8: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

TABLE II: The statistics (features) extracted from each time window λ when a packet arrives.

Channel

Type

𝝀

(windows)

Extracted

Statistics

Stats from

all windows Description: All traffic… Example

MI 5, 3, 1,

0.1, 0.01 𝑤𝑖, 𝜇𝑖 , 𝜎𝑖

15 …originating from the same MAC and IP address (00:1C:B3:09:85:15, 192.168.0.2) →*

H 15 …originating from an IP address 192.168.0.2→*

HHjit 15 …from one IP to another (delta in arrival times) 192.168.0.2→192.168.0.5

HH 5, 3, 1,

0.1, 0.01

𝑤𝑖, 𝜇𝑖 , 𝜎𝑖, ‖𝑆𝑖, 𝑆𝑗‖,

𝑅𝑆𝑖,𝑆𝑗, 𝐶𝑜𝑣𝑆𝑖,𝑆𝑗

, 𝑃𝑆𝑖,𝑆𝑗

35 …shared between two IP addresses 192.168.0.2↔192.168.0.5

HpHp 35 …shared between two network sockets 192.168.0.2:23523↔192.168.0.5:80

The packet’s… Statistics Aggregated by # Features Description of the Statistics

…size 𝜇𝑖 , 𝜎𝑖 SrcMAC-IP, SrcIP, Channel, Socket 8 Bandwidth of the outbound traffic

…size ‖𝑆𝑖, 𝑆𝑗‖, 𝑅𝑆𝑖,𝑆𝑗, 𝐶𝑜𝑣𝑆𝑖,𝑆𝑗

, 𝑃𝑆𝑖,𝑆𝑗 Channel, Socket 8

Bandwidth of the

outbound and inbound traffic together

…count 𝑤𝑖 SrcMAC-IP, SrcIP, Channel, Socket 4 Packet rate of the outbound traffic

…jitter 𝑤𝑖, 𝜇𝑖 , 𝜎𝑖 Channel 3 Inter-packet delays of the outbound traffic

Type Statistic Notation Calculation

1D

Weight 𝑤 𝑤

Mean 𝜇𝑆𝑖 𝐿𝑆 𝑤⁄

Std 𝜎𝑆𝑖 √|𝑆𝑆 𝑤⁄ − (𝐿𝑆 𝑤⁄ )2|

2D

Magnitude ‖𝑆𝑖 , 𝑆𝑗‖ √𝜇𝑆𝑖

2 + 𝜇𝑆𝑗

2

Radius 𝑅𝑆𝑖,𝑆𝑗 √(𝜎𝑆𝑖

2 )2

+ (𝜎𝑆𝑗

2 )2

Approx.

Covariance 𝐶𝑜𝑣𝑆𝑖,𝑆𝑗

𝑆𝑅𝑖𝑗

𝑤𝑖 + 𝑤𝑗

Correl. Coef. 𝑃𝑆𝑖,𝑆𝑗

𝐶𝑜𝑣𝑆𝑖,𝑆𝑗

𝜎𝑆𝑖 𝜎𝑆𝑗

72 66 69 75 78 73 67 70 76 79 85 86 99 100 92 93 113

114

106

107 49 35 42 56 63 50 36 43 97 104

111 83 90 57 64 108

101 58 77 12 27 54 61 47 33 40 94 80 87 34 41 48 55 62 9 24 51 74 44 71 6 21 30 65 0 15 37 68 3 18 8 23 14 29 11 26 2 17 5 20 46 53 60 32 39 7 22 13 28 10 25 1 16 4 19 45 52 59 31 38 95 102

109 81 88 96 103

110 82 89 98 105

112 84 91

Feature (dimension) ID0.0

0.2

0.4

0.6

0.8

1.0

Corre

latio

n Di

stan

ce

Fig. 4: An example dendrogram of the 115 features clusteredtogether, from one million network packets.

~c be an n dimensional vector containing the linear sum ofeach feature’s values, such that element c(i) =

∑nt

t=0 x(i)t for

feature i at the time index t. Similarly, Let ~cr denote a vectorcontaining the summed residuals of each feature, such thatc(i)r =

∑nt

t=0

(x(i)t − c(i)

nt

). Similarity, Let ~crs denote a vector

containing the summed squared residuals of each feature, such

that c(i)rs =∑nt

t=0

(x(i)t − c(i)

nt

)2. Let C be the n-by-n partial

correlation matrix, where

[Ci,j ] =

nt∑

t=0

((x(i)t −

c(i)

nt

)(x(j)t −

c(j)

nt

))(9)

is the sum of products between the residuals of features i and j.Let D be the correlation distance matrix between each featuresof X .

Using C and ~crs, the correlation distance matrix D can becomputed at any time by

D = [Di,j ] = 1− Ci,j√c(i)rs

√c(j)rs

(10)

Now that we know how to obtain the distance matrixD incrementally, we can perform agglomerative hierarchalclustering on D to find f . Briefly, the algorithm starts withn clusters, one cluster for each point represented by D. It thensearches for the two closest points and joins their associatedclusters. This search and join procedure repeats until thereis one large cluster containing all n points. The tree whichrepresents the discovered links is called a dendrogram (picturedin Fig. 4). For further information on the clustering algorithm,we refer the reader to [34]. Typically, hierarchal clusteringcannot be performed on large datasets due to its complexity.

However, our distance matrix is small (n being in the order ofa few hundred) and therefore practical to compute on-site.

Finally, with the dendrogram of D, we can easily find kclusters (groups of features) where no cluster is larger than m.The procedure is to break the dendrogram’s largest link (i.e.,the top most node) and then check if all found clusters havea size less then m. If there is at least one cluster with a sizegreater than m, then we repeat the process on the exceedingclusters. At the end of the procedure, we will have k groupsof features with a strong inter-correlation, where no singlegroup is larger than m. These groupings are saved, and usedto perform the mapping f .

The algorithm described in this section is suitable foronline and on-site processing because (1) it never stores morethan one instance in memory, (2) it uses very little memory(On2 during train-mode) and, (3) is very fast since the updateprocedures require updating small n-by-n distance matrix.

D. Anomaly Detector (AD)

As depicted in Fig. 3, the AD component contains a specialneural network we refer to as a KitNET (Kitsune NETwork).KitNET is an unsupervised ANN designed for the task ofonline anomaly detection. KitNET is composed of two layersof autoencoders: the Ensemble Layer and the Output Layer.

Ensemble Layer: An ordered set of k three-layer autoen-coders, mapped to the respective instances in v. This layer isresponsible for measuring the independent abnormality of eachsubspace (instance) in v. During train-mode, the autoencoderslearn the normal behavior of their respective subspaces. Duringboth train-mode and execute-mode, each autoencoder reportsits RMSE reconstruction error to the Output Layer.

Output Layer: A three-layer autoencoder which learns thenormal (i.e., train-mode) RMSEs of the Ensemble Layer. Thislayer is responsible for producing a final anomaly score, con-sidering (1) the relationship between subspace abnormalities,and (2) naturally occurring noise in the network traffic.

We will now detail how KitNET operates the Ensemble andOutput Layers.

1) Initialization: When the AD receives the first set ofmapped instances v from the FM, the AD initializes KitNET’sarchitecture using v as a blueprint. Concretely, let θ denote anentire autoencoder, and let L(1) and L(2) denote the Ensembleand Output Layers respectively. L(1) is defined as the orderedset

L(1) = {θ1, θ2, . . . , θk} (11)

such that autoencoder θi ∈ L(1) has three layers of neurons:dim(~vi) neurons in the input and output, and dβ · dim(~vi)e

8

Page 9: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

Algorithm 4: The back-propagation training algo-rithm for KitNET.

procedure: train(L(1), L(2),v)// Train Ensemble Layer

1 ~z ← zeros(k) . init input for L(2)

2 for (θi in L(1)) do3 ~v′i = norm0−1(~vi)4 Ai, ~yi ← hθi(~v

′i) . forward propagation

5 deltasi ← bθi(~vi′, ~yi) . backward propagation

6 θi ← GD`(Ai, deltasi) . weight update7 ~z[i]← RMSE(~vi

′, ~yi) . set error signal8 end// Train Output Layer

9 ~z′ = norm0−1(~z) A0, ~y0 ← hθ0(~z′) . forwardpropagation

10 deltas0 ← bθ0(~z′, ~y0) . backward propagation11 θ0 ← GD`(A0, deltas0) . weight update12 return L(1), L(2)

neurons in the inner layer, where β ∈ (0, 1] (in our experimentswe take β = 3

4 ). Fig. 5 illustrates the described mappingbetween ~vi ∈ v and ~θi ∈ L(1).

L(2) is defined as the single autoencoder θ0, which has kinput and output neurons, and dk · βe inner neurons.

Layers L(1) and L(2) are not connected via the weightedsynapses found the common ANN. Rather, the inputs toL(2) are the 0-1 normalized RMSE error signals from eachrespective autoencoder in L(1). Signaling the aggregated errors(RMSEs) of each autoencoder in L(1), as opposed to signalingfrom each individual neuron of L(1), reduces the complexity ofthe network. Finally, the weights of autoencoder θi in KitNETis initialized with random values from the uniform distributionU( −1

dim(~vi), 1dim(~vi)

).

2) Train-mode: Training KitNET is slightly different thantraining a common ANN network, as described in section III.This is because KitNET signals RMSE reconstruction errorsbetween the two main Layers of the network. Furthermore,KitNET is trained using SGD using each observed instance v

RMSERMSE

Ensemble LayerFeature Mapper (FM) Anomaly Detector (AD)

RMSERMSE

RMSERMSERMSE

𝜃𝑛𝑎−2

𝜃2𝜃3

𝜃1

𝜃𝑛𝑎−1𝜃𝑛𝑎

𝜃4

Fig. 5: An illustration of the mapping process between theFM and the Ensemble Layer of KitNET: a sub-instance ~vi ismapped from ~x, which is then sent to the autoencoder θi.

Algorithm 5: The execution algorithm for KitNET.

procedure: execute(L(1), L(2),v)// Execute Ensemble Layer

1 ~z ← zeros(k) . init input for L(2)

2 for (θi in L(1)) do3 ~v′i = norm0−1(~vi)4 Ai, ~yi ← hθi(~v

′i) . forward propagation

5 ~z[i]← RMSE(~vi′, ~yi) . set error signal

6 end// Execute Output Layer

7 ~z′ = norm0−1(~z)8 A0, ~y0 ← hθ0(~z1

′) . forward propagation9 return ← RMSE(~z′, ~y0)

exactly once. The algorithm for training KitNET on a singleinstance is presented in Algorithm 4.

We note that in order to perform the 0-1 normalization online 3, each autoencoder must maintain a record of the largestand smallest value seen for each input feature. Furthermore,these maximum and minimum records are only updated duringtrain-mode.

Similar to the discussion in section III-E, KitNET must betrained on normal data (without the presence of attacks). Thisis a common assumption [35], [36] and is practical in manytypes of computer networks, such as IP camera surveillancesystems. Furthermore, there are methods for filtering thetraining data in order to reduce the impact of the possiblepreexisting attacks in the network [37], [38].

3) Execute-mode: In execute-mode, KitNET does not up-date any of its internal parameters. Instead, KitNET performsforward propagation through the entire network, and returnsL(2)’s RMSE reconstruction error. The execution procedure ofKitNET is presented in Algorithm 5.

L(2)’s reconstruction error measures the instance’s abnor-mality with respect to the relationships between the sub-spaces in v. For example, consider two autoencoders fromthe Ensemble Layer θi, θj ∈ L(1). If the of RMSE θi andθj correlate during train-mode, then a lack of correlation inexecute-mode may be considered a more significant anomalythan say the simple sum of their independent RMSEs (similarlyin vice versa). Since the Output Layer L(2) learns theserelationships (and other complex relationships) during train-mode, L(2)’s reconstruction error of L(1)’s RMSEs will reflectthese anomalies.

4) Anomaly Scoring: The output of KitNET is the RMSEanomaly score s ∈ [0,∞)], as described in section III-E. Thelarger the score s, the greater the anomaly. To use s, onemust determine an anomaly score cutoff threshold φ. The naiveapproach is to set φ to the largest score seen during train-mode,where we assume that all instances represent normal traffic.Another approach is to select φ probabilistically. Concretely,one may (1) fit the outputted RMSE scores to log-normalor non-standard distribution, and then (2) raise an alert if shas a very low probability of occurring. A user of KitNETshould decide the best method of selecting φ according tohis/her application of the algorithm. In section V, we evaluateKitsune’s detection capabilities based on its raw RMSE scores.

9

Page 10: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

E. Complexity

As a baseline, we shall compare the complexity of KitNETto a single three-layer autoencoder over the same feature space~x ∈ Rn, with the compression layer ratio β ∈ (0, 1].

The complexity of executing the single autoencoder is asfollows: In section III-F, we found that the complexity ofactivating layer l(i+1) in an ANN is O

(l(i) · l(i+1)

). Therefore,

the complexity of executing the single autoencoder is

O(n · βn+ βn · n) = O(n2) (12)

The complexity of executing KitNET is as follows: Weremind the reader that k denotes the number of subspaces(autoencoders) selected by the FM, and that m is an inputparameter of the system which defines the maximum numberof inputs for any given autoencoder in L(1). The complexity ofexecuting L(1) and L(2) are O(km2) and O(k2) respectively.Since the variable m ∈ 1, 2, . . . , 10 is a constant parameter ofthe system, the total complexity of KitNET is

O(km2 + k2) = O(k2) (13)

The result of (13) tells us that the complexity of theEnsemble Layer scales linearly with n, but the complexityof the Output Layer depends on how many autoencoders(subspaces) are designated by the FM. Concretely, the best casescenario is where k = n

m , in which case performance increasedby a factor of m. The worst case scenario is where the FMdesignates nearly every single feature its own autoencoder inL(1). In this case, k = n, and KitNET operates as a singlewide autoencoder meaning that there is no performance gain.This will also occur if the user sets m = 1 or m = n.

However, it is very rare for the latter case to occur ona natural dataset. This is because it would mean that thedendrogram from the FM’s clustering process in section IV-Cis completely imbalanced. For example,

dcor(x(1), x(2)) < dcor(x

(2), x(3)) < dcor(x(3), x(4)) < ...

(14)where xi is the i-th dimension of Rn. Therefore, it can beexpected that in the presence of many features, KitNET willhave a runtime which is faster than a single autoencoderor stacked autoencoder. Finally, the complexity of trainingKitNET is also O(k2) since we learn from each instance only

Fig. 6: Two of the cameras used in the IP camera video surveil-lance network. Left: SNC-EM602RC. Right: SNC-EB602R.

once (see Algorithm 4). This can be contrasted to ANN-basedclassifiers which typically make multiple passes (epochs) overthe training set.

V. EVALUATION

In this section, we provide an evaluation of Kitsune interms of its detection and runtime performance. We open bydescribing the datasets, followed by the experiment setup, andfinally, close by presenting our results.

A. Datasets

The goal of Kitsune is to provide a light weight IDS whichcan handle many packets per second on a simple router. Giventhis goal, we evaluated Kitsune’s capabilities in detectingattacks in a real IP camera video surveillance network. Thenetwork (pictured in Fig. 7) consists of two deploymentsof four HD surveillance cameras each. The cameras in thedeployments are powered via PoE, and are connected to theDVR via a site-to-site VPN tunnel. The DVR at the remote siteprovides users with global accessibility to the video streamsvia a client-to-site VPN connection. The cameras used in thenetwork, and their configurations are described in Table IV.Fig. 6 pictures two of the eight cameras used in our setup.

There are a number of attacks which can performed onthe video surveillance network. However, the most criticalattacks affect the availability and integrity of the video uplinks.For example, a SYN flood on a target camera, or a man inthe middle attack involving video injection into a live videostream. Therefore, in our evaluation, we focused on these typesof attacks. Table III summarizes the attack datasets used inour experiments, and Fig. 7 illustrates the location (vector) ofthe attacker for each respective attack. The Violation columnin Table III indicates the attacker’s security violation on thenetwork’s confidentiality (C), integrity (I), and availability (A).All datasets where recorded from the packet capture point asindicated in Fig. 7.

To setup the active wiretap, we used a Raspberry PI 3B as aphysical network bridge. The PI was given a USB-to-Ethernetadapter to provide the second Ethernet port, and then placedphysically in the middle of the cable.

We note that for some of the attacks, the malicious packetsdid not explicitly traverse the router on which Kitsune isconnected to. In these cases, the FE components implicitlycaptures these attacks as a result of statistical changes in thenetwork’s behavior. For example, the man in the middle attacksaffect the timing of the packets, but not necessarily the contentsof the packets themselves.

In order to evaluate Kitsune on a nosier network, we usedan additional network. The additional network was a Wi-Finetwork populated with 9 IoT devices, and three PCs. TheIoTs were a thermostat, baby monitor, webcam, two differentdoorbells, and four different cheap security cameras. On thisparticular network, we infected a one of the security cameraswith a real sample of the Mirai botnet malware.

B. Experiment Setup

Offline algorithms typically perform better than onlinealgorithms. This is because offline algorithms have access to

10

Page 11: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

TABLE III: The datasets used to evaluate Kitsune.

Attack

Type Attack Name Tool Description: The attacker… Violation Vector # Packets

Train

[min.]

Execute

[min.]

Recon.

OS Scan Nmap …scans the network for hosts, and their operating systems, to

reveal possible vulnerabilities. C 1 1,697,851 33.3 18.9

Fuzzing SFuzz …searches for vulnerabilities in the camera’s web servers by

sending random commands to their cgis. C 3 2,244,139 33.3 52.2

Man in the

Middle

Video Injection Video Jack …injects a recorded video clip into a live video stream. C, I 1 2,472,401 14.2 19.2

ARP MitM Ettercap …intercepts all LAN traffic via an ARP poisoning attack. C 1 2,504,267 8.05 20.1

Active Wiretap Raspberry

PI 3B

…intercepts all LAN traffic via active wiretap (network bridge)

covertly installed on an exposed cable. C 2 4,554,925 20.8 74.8

Denial of

Service

SSDP Flood Saddam …overloads the DVR by causing cameras to spam the server

with UPnP advertisements. A 1 4,077,266 14.4 26.4

SYN DoS Hping3 …disables a camera’s video stream

by overloading its web server. A 1 2,771,276 18.7 34.1

SSL

Renegotiation THC

…disables a camera’s video stream by sending many SSL

renegotiation packets to the camera. A 1 6,084,492 10.7 54.9

Botnet

Malware Mirai Telnet

…infects IoT with the Mirai malware by exploiting default

credentials, and then scans for new vulnerable victims network. C, I X 764,137 52.0 66.9

TABLE IV: The specifications and statistics of the camerasused in the experiments.

Model Resolution Codec FPS Avg PPS Avg BW Protocol

SNC-EM602RC 1280x720 H.264/MPEG4 15 195 1.8Mbit/s RTP SNC-EB600 1280x720 H.264/MPEG4 15 290 1.8Mbit/s Https/TCP SNC-EM600 1280x720 H.264/MPEG4 15 350 1.4Mbit/s RTP SNC-EB602R 1280x720 H.264/MPEG4 15 320 1.8Mbit/s Http/TCP

SNC-

EM602RC SNC-

EM600 SNC-EB600

SNC-EB602R

Resolution 1280x720 Codec H.264/MPEG4

Frames/Sec 15

Avg.

Packets/Sec 195 350 290 320

Avg. Bandwidth

1.8 Mbit/s 1.4 Mbit/s 1.8 Mbit/s 1.8 Mbit/s

Protocol RTP RTP Https

(TLSv1) Http/TCP

the entire dataset during training and can perform multiplepasses over the data. However, online algorithms are usefulwhen resources, such as the computational power and memory,are limited. In our evaluations, we compare Kitsune to bothonline and offline algorithms. The online algorithms providea baseline for Kitsune as an online anomaly detector, and theoffline algorithms provide a perspective on the Kitsune’s per-formance as a sort of upperbound. As an additional baseline,we evaluate Suricata [39] –a signature-based NIDS. Suricata isan open source NIDS which is similar to the Snort NIDS, but isparallelized over multiple threads. Signature-based NIDS havea much lower false positive rate than anomaly-based NIDS.However, they are incapable of detecting unknown threats orabnormal/abstract behaviors. We configured Suricata to use the13,465 rules from the Emerging Threats repository [40].

For the offline algorithms, we used Isolation Forests (IF)[41] and Gaussian Mixture Models (GMM) [42]. IF is anensemble based method of outlier detection, and GMM isa statistical method based on the expectation maximizationalgorithm.For the online algorithms, we used an incrementalGMM from [43], and pcStream2 [44]. pcStream2 is a streamclustering algorithm which detects outliers by measuring theMahalanobis distance of new instances to known clusters.

For each experiment (dataset) every algorithm was trained

Distribution A

PoE Switch

VPN

Router

VPN

Router

SwitchDVR

Server

Kitsune

Attacker

Distribution B

Remote Site

Client

1

2

Attacker

3

IoT Distribution Remote Site

X

Su

rve

illa

nce

Ne

two

rkIo

T N

etw

ork

Attacker

IoT Server

WiFi Router

Kitsune

Fig. 7: The network topologies used in the experiments: thesurveillance network (top) and the IoT network (bottom).

on the first million packets, and then executed on the remain-der. The duration of the first million packets depends on thepacket rate of the network at the time of capture. For example,with the OS Scan dataset, each algorithm was trained on thefirst one million packets, and then executed on the remaining697,851 packets. Table III lists the relative train and executeperiods for each dataset. Each algorithm received the exactsame features from Table II.

Kitsune has one main parameter, m ∈ {1, 2, . . . , n}, whichis the maximum number of inputs for any one autoencoderof KitNET’s ensemble. For our detection performance evalua-tions, we set m = 1 and m = 10. For all other algorithms, weused the default settings.

11

Page 12: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

C. Evaluation Metrics

The output of an anomaly detector (s) is a value on therange of [0,∞), where larger values indicate greater anomalies(e.g., the RMSE of an autoencoder). This output is typicallynormalized such that scores which have a value less than 1are normal, and greater than 1 are anomalies. The score s isnormalized by dividing it by a cutt-off threshold φ. Choosingφ has a great effect on an algorithm’s performance.

The detection performance of an algorithm, on a particulardataset, can be measured in terms of its true positives (TP ),true negatives (TN ), false positives (FP ), and false negatives(FN ). In our evaluations we measure an algorithm’s truepositive rate (TPR = TP

TP+FN ) and the false negative rate(FNR = FN

FN+TP ) when φ is selected so that the false positiverate (FPR = FP

FP+TN ) is very low (i.e., 0.001). We also countthe number of true positives where the FPR is zero. Thesemeasures capture the amount of malicious packets which weredetected correctly, and accidentally missed with few and nofalse alarms. In network intrusion detection, is it importantthat there be a minimal number of false alarms since it is timeconsuming and expensive an analyst to investigate each alarm.

Fig. 8 plots KitNET’s anomaly scores before and during afuzzing attack. The lower blue line is the lowest threshold wemight select during the training phase which produces no FPs.The upper blue line is the lowest threshold possible during theattack which produces no FPs (kind of like a global optimum).Another way of looking at these two thresholds is like abest-case and worst case scenarios for threshold selection.Therefore, by measuring the performance of each at these twothresholds, we can get a better idea of an algorithm’s potentialas an anomaly detector.

To measure the general performance (i.e. with every pos-sible φ), we used the area under the receiver operating char-acteristic curve (AUC), and the equal error rate (EER). In ourcontext, the AUC is the probability that a classifier will rank arandomly chosen anomalous instance higher than a randomlychosen normal instance. In other words, an algorithm with anAUC of 1 is a perfect anomaly detector on the given dataset,whereas an algorithm with an AUC of 0.5 is randomly guessinglabels. The EER is a measure which captures an algorithm’strade-off between its FNR and FPR. It is computed as thevalue of FNR and FPR when they are minimal and equalto one another.

D. Detection Performance

Figure 9 presents TPR and FNR of each algorithm overeach dataset when the threshold is selected so that the FPR is0 and 0.001. The figure also presents the AUC and EER aswell. We remind the reader that the GMM and Isolation Forestare a batch (offline) algorithms, which have full access to theentire dataset and perform many iterations over the dataset.Therefore, these algorithms serve as an optimum goal for usto achieve. In Fig. 9 we see that Kitsune performed very wellrelative to these algorithms. In particular, Kitsune performedeven better than the GMM in detecting the active wiretap.Moreover, our algorithm achieved a better EER than the GMMon the AR, Fuzzing, Mirai, SSL R., SYN and active wiretapdatasets.

Fig. 8: KitNET’s RMSE anomaly scores before and duringa fuzzing attack. The red lines indicate when the attackerconnects to the network (left) and initiates the attack (right).

As evident from Fig. 9, there is a trade-off between thedetection performance and m (runtime performance). Userswho prefer better detection performance over speed should usean m which is close to 1 or n. Whereas users who prefer speedshould use a moderate sized m. The Kitsune gives the user theability to adjust this parameter according to the requirements ofthe user’s system. The affect of m on the runtime performanceis presented in section V-E.

As a baseline comparison to the performance of onlinealgorithms we compared Kitsune to the incremental GMMand pcStream2. Overall, it is clear that Kitsune out performsboth algorithms in terms of AUC and EER.

The top row of figure 9 presents the maximum numberof true positives each algorithm was able to obtain, when thethreshold was set so that here were no false positives (e.g., theblue dashed bar in Fig. 8). In other words, these figures showhow well each anomaly detector is able to raise the anomalyscore of malicious packets above the noise floor. The figuresshows that Kitsune detects attacks across the datasets betterthan the other algorithms, and more so than the GMM inmost cases. We note that Kitsune with m = 10 sometimesperforms better than m = 1. This is because the Output Layerautoencoder lowers the noise floor of the ensemble. This affectcan amplify the scores of some outliers.

E. Runtime Performance

One of Kitsune’s greatest advantages is its runtime per-formance. As discussed in section IV-E, KitNET’s ensembleof small autoencoders is more efficient than using a singleautoencoder. This is because the ensemble reduces the overallnumber of operations required to process each instance.

To demonstrate this effect, we performed benchmarks ona Raspberry PI 3B and an Ubuntu Linux VM running on aWindows 10 PC (full details are available in Table V). Theexperiments were coded in C++, involved n = 198 statisticalpacket features, and were executed on a single core (physicalcore on the PI and logical core on the Ubuntu VM).

Fig. 10 plots the affect which KitNET’s ensemble size hason the packet processing rate. With a single autoencoder in

12

Page 13: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

0 0 0 0

2e−

06 0 0

0 0 0 0 0 0 0

0

0.00

27 0 0 0

0.00

0142

0.00

0142

0 0

0.00

012 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0

9.4e

−05 0

0.32

2297

0.32

1032

0 0

5.4e

−05

1.7e

−05

2.2e

−05

2.2e

−05

2.2e

−05

0 0

1e−

06

0.00

2835 0

4.2e

−05

2.3e

−05

SYN DoS Video Inj. Wiretap

OS Scan SSDP F. SSL R.

ARP Fuzzing Mirai

0.00

0.250.500.751.00

0.00

0.250.500.751.00

0.00

0.250.500.751.00

TP

R (

sqrt

−sc

ale)

True Positive Rate (TPR) at FPR=0

1 1 1 1

0.99

9998 1 1

1 1 1 1 1 1 1

1

0.99

73 1 1 1

0.99

9858

0.99

9858

1 1

0.99

988 1 1 1 1

1 1 1 1 1 1 1

1 1 1 1 1 1 1

1 1 1

0.99

9906 1

0.67

7703

0.67

8968

1 1

0.99

9946

0.99

9983

0.99

9978

0.99

9978

0.99

9978

1 1

0.99

9999

0.99

7165 1

0.99

9958

0.99

9977

SYN DoS Video Inj. Wiretap

OS Scan SSDP F. SSL R.

ARP Fuzzing Mirai

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

FN

R

False Negative Rate (FNR) at FPR=0

0.00

147

0.08

33

0.00

052 0

0.00

047

4e−

05

2e−

05

0.01

18

0.00

965

0.00

985

0.00

986

0.03

51

0.24

6

0.11

6

0.24

6

0.24

6

0.00

024

0.99

8

0.94

9

0.99

5 1

1 0

0.99

9

0.99

9

0.00

127

0.03

53

0.01

03

0.00

112

0.00

159

0.01

06

0.01

06

0.00

413

0.97

0.00

108

0.00

046

0.99

7

0.99

7

0.01

42

0.01

5

0.00

44

0.00

539

0.02

47

0.02

51

1e−

05

0.00

124

0.00

027

8e−

05

0.00

147

0.00

046

SYN DoS Video Inj. Wiretap

OS Scan SSDP F. SSL R.

ARP Fuzzing Mirai

0.250.500.751.00

0.250.500.751.00

0.250.500.751.00

TP

R (

sqrt

−sc

ale)

True Positive Rate (TPR) at FPR=0.001

1

0.99

853

0.91

667

0.99

948 1

0.99

953

0.99

996

0.99

998 1

0.98

825

0.99

035 1

0.99

015

0.99

014

1

0.96

49

0.75

391 1

0.88

42

0.75

391

0.75

391

1

0.99

976

0.00

192 1

0.05

13

0.00

541 0

1 1 0 1 1

0.00

105

0.00

062

0.99

873

0.96

467

0.98

968

0.99

888

0.99

841

0.98

941

0.98

941

1

0.99

587

0.02

994

0.99

892

0.99

954

0.00

329

0.00

327

1

0.98

585

0.98

501

0.99

56

0.99

461

0.97

533

0.97

491

0.99

999 1

0.99

877

0.99

973

0.99

992

0.99

853

0.99

954

SYN DoS Video Inj. Wiretap

OS Scan SSDP F. SSL R.

ARP Fuzzing Mirai

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

FN

R

False Negative Rate (FNR) at FPR=0.001

Signature-basedSuricata

Anom-based (batch)Iso. Forest

GMM

Anom-based (online)GMM Inc.

pcStream

Kitsune (m=10)

Kitsune (m=1)

0.50

001

0.59

836

0.76

709

0.59

841

0.71

699

0.58

423

0.79

499

0.50

001

0.90

702

0.94

929

0.94

692

0.74

191

0.94

805

0.94

806

0.50

003

0.58

667

0.83

353

0.67

969

0.63

469

0.95

024

0.92

289

0.5

0.95

797

0.99

988

0.93

163

0.99

081

0.99

995

0.99

997

0.50

001

0.94

839

0.99

995

0.84

836

0.99

626

0.99

984

0.99

997

0.50

06

0.51

661

0.87

324

0.68

661

0.57

698

0.85

378

0.80

458

0.50

046

0.52

547

0.99

928

0.58

598

0.87

399

0.99

919

0.99

857

0.73

159

0.87

216

0.96

509

0.88

418

0.87

37

0.97

353

0.93

92

0.5

0.57

378

0.87

878

0.55

266

0.73

931

0.88

221

0.90

876

SYN DoS Video Inj. Wiretap

OS Scan SSDP F. SSL R.

ARP Fuzzing Mirai

Sur

icat

a

Iso.

For

est

GM

M

Inc.

GM

M

pcS

trea

m

Kits

une

(m=

10)

Kits

une

(m=

1)

Sur

icat

a

Iso.

For

est

GM

M

Inc.

GM

M

pcS

trea

m

Kits

une

(m=

10)

Kits

une

(m=

1)

Sur

icat

a

Iso.

For

est

GM

M

Inc.

GM

M

pcS

trea

m

Kits

une

(m=

10)

Kits

une

(m=

1)

0.50.60.70.80.91.0

0.50.60.70.80.91.0

0.50.60.70.80.91.0

AU

C

Area Under the Curve (AUC)

0.50

001

0.55

628

0.29

056

0.43

881

0.65

069

0.45

239

0.23

152

0.5

0.15

68

0.09

4

0.09

742

0.28

033

0.09

413

0.09

412

0.50

002

0.44

725

0.23

602

0.24

524

0.68

36

0.07

735

0.13

557

0.5

0.07

753

0.00

129

0.92

263

0.03

16

0.00

12

0.00

091

0.50

001

0.09

357

0.00

056

0.23

274

0.97

355

0.00

104

0.00

067

0.49

97

0.49

067

0.20

398

0.34

993

0.39

979

0.22

8

0.26

756

0.50

023

0.51

388

0.01

095

0.44

774

0.81

684

0.00

317

0.00

32

0.34

539

0.15

408

0.08

449

0.88

022

0.82

166

0.04

603

0.10

761

0.5

0.54

265

0.21

027

0.52

538

0.74

978

0.17

273

0.16

93

SYN DoS Video Inj. Wiretap

OS Scan SSDP F. SSL R.

ARP Fuzzing Mirai

Sur

icat

a

Iso.

For

est

GM

M

Inc.

GM

M

pcS

trea

m

Kits

une

(m=

10)

Kits

une

(m=

1)

Sur

icat

a

Iso.

For

est

GM

M

Inc.

GM

M

pcS

trea

m

Kits

une

(m=

10)

Kits

une

(m=

1)

Sur

icat

a

Iso.

For

est

GM

M

Inc.

GM

M

pcS

trea

m

Kits

une

(m=

10)

Kits

une

(m=

1)

0.000.250.500.751.00

0.000.250.500.751.00

0.000.250.500.751.00

EE

R

Equal Error Rate (EER)-Higher is better -Lower is better

Fig. 9: The experimental results for all algorithms on each of the datasets: the TPR when the FPR is equal to 0.001 (top-left),the FNR when the FPR is equal to 0.001 (top-right), the AUC (bottom-left), and the EER (bottom-right).

L(1), the PI and PC can handle approx. 1,000 and 7,500 pack-ets per second respectively. However, with 35 autoencodersin L(1), the performance of both environments improve by afactor of 5 to approx. 5400 and 37,300 respectively. Fig. 11provides a closer look at the PI’s packet processing times withk = 1 and k = 35. This figure shows that using an ensemblecan also reduce the variance in the processing times. This maybe beneficial in applications where jitter in network traffic isundesirable.

The results of the Raspberry PI’s benchmark show thata simple network router, with limited resources, can supportKitsune as an NIDS. This means that Kitsune is an inexpensiveand reliable distributed NIDS solution. We note that since theexperiments were run on a single core, there is potential toincrease the packet processing rates further. To achieve this,

we plan on parallelizing KitNET over multiple cores.

VI. ADVERSARIAL ATTACKS & COUNTERMEASURES

When using Kitsune, there are several aspects one shouldconsider. First off all, an advanced adversary may attemptto perform adversarial machine learning [45]. When firstinstalled, Kitsune assumes that all traffic is benign while intrain-mode. Therefore, a preexisting adversary may be ableto evade Kitsune’s detection. However, during execute-mode,Kitsune will detect new attacks, and new threats as theypresent themselves. Regardless, a user should be aware of thisrisk when installing Kitsune on a potentially compromisednetwork. As a future work, it would be interesting to find amechanism which can safely filter out potentially contaminatedinstances during the training process. For example, we can first

13

Page 14: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

0

10000

20000

30000

40000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Size of Ensemble Layer (k)

Rat

e [P

acke

ts/S

ec]

PC: Exec−mode

PC: Train−mode

PI: Exec−mode

PI: Train−mode

Packet Processing Rate on Single Logical Core

Fig. 10: The affect KitNET’s ensemble size (k) has on the average packet processing rate, while running on a single core of aRaspberry PI and an Ubuntu VM (PC), using n = 198 features.

TABLE V: The environments used to perform the benchmarks.

Environment 1 Environment 2 Raspberry PI 3B Ubuntu VM

CPU Type Broadcom BCM2837 Intel i7-4790 Clock 1.2GHz 3.60GHz Cores 4 4 (8 logical)

RAM 1 GB 4 GB

execute and see if there is a high anomaly score. If there is,then we will not train from the instance (since we only want tolearn from benign instances). By doing so, we can potentiallyremain in train-mode indefinitely.

If there is a significant concern that the target network hasbeen contaminated, then one may prefer to use a signaturebased NIDS, such as Snort. The trade off is that a signaturebased NIDS cannot automatically detect new or abstract threats(as demonstrated in the evaluation results). A good compro-mise would be to install an efficient NIDS (such as Snort 3.0or Suricata) alongside Kitsune.

Another threat to Kitsune is a DoS attack launched againstthe FE. In this scenario, an attacker sends many packets withrandom IP addresses. by doing so, the FE will creates manyincremental statistics which eventually consume the device’smemory. Although this attack causes a large anomaly, thesystem may become instable. Therefore, it is highly recom-mended that the user limit the number of incremental statisticswhich can be stored in memory. One may note that witha C++ implementation of Kitsune, roughly 1MB of RAMcan contain approximately 1,000 network links (assuming fivedamped windows per link). A good solution to maintaining

0.00

0.05

0.10

950 1000 1050 1100

Packet Processing Time [usec]

dens

ity

k = 1

0

2

4

6

175 200 225 250 275

Packet Processing Time [usec]

Mode

Execute

Train

k = 35

Fig. 11: Density plots of the packet processing times in thePI, with k = 1 (top), and k = 35 (bottom).

a small memory footprint is to periodically search and deleteincremental statistics with wi ≈ 0. In practice, the majorityof incremental statistics remain in this state since we userelatively large λs (quick decay).

VII. CONCLUSION

Kitsune is a neural network based NIDS which has beendesigned to be efficient and plug-and-play. It accomplishesthis task by efficiency tracking the behavior of all networkchannels, and by employing ensemble of auto encoders (Kit-NET) for anomaly detection. In this paper, we discussed theframework’s online machine learning process in detail, andevaluated it in terms of detection and runtime performance.KitNET, an online algorithm performed nearly as well as otherbatch / offline algorithms, and in some cases better. Moreover,the algorithm is efficient enough to run on a single core of aRaspberry PI, and has an even greater potential on strongerCPUs.

In summation, there is a great benefit in being able to de-ploy an intelligent NIDS on simple network devices, especiallywhen the entire deployment process is plug-and-play. We hopethat Kitsune is beneficial for professionals and researchersalike, and that the KitNET algorithm sparks an interest infurther developing the domain of online neural network basedanomaly detection.

ACKNOWLEDGMENT

The authors would like to thank Masayuki Nakae, NECCorporation of America, for his feedback and assistance inbuilding the surveillance camera deployment. The authorswould also like to thank Yael Mathov, Michael Bohadana, andYishai Wiesner for their help in creating the Mirai dataset.

REFERENCES

[1] Marshall A Kuypers, Thomas Maillart, and Elisabeth Pate-Cornell. Anempirical analysis of cyber security incidents at a large organization.Department of Management Science and Engineering, Stanford Univer-sity, School of Information, UC Berkeley, 30, 2016.

[2] Dimitrios Damopoulos, Sofia A Menesidou, Georgios Kambourakis,Maria Papadaki, Nathan Clarke, and Stefanos Gritzalis. Evaluation ofanomaly-based ids for mobile devices using machine learning classi-fiers. Security and Communication Networks, 5(1):3–14, 2012.

[3] Yinhui Li, Jingbo Xia, Silan Zhang, Jiakai Yan, Xiaochuan Ai, andKuobin Dai. An efficient intrusion detection system based on supportvector machines and gradually feature removal method. Expert Systemswith Applications, 39(1):424–430, 2012.

14

Page 15: VW Kitsune: An Ensemble of Autoencoders for Online Network ... · in Japanese folklore, is a mythical fox-like creature that has a number of tails, can mimic different forms, and

[4] Supranamaya Ranjan. Machine learning based botnet detection us-ing real-time extracted traffic features, March 25 2014. US Patent8,682,812.

[5] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomalydetection: A survey. ACM computing surveys (CSUR), 41(3):15, 2009.

[6] Howard B Demuth, Mark H Beale, Orlando De Jess, and Martin THagan. Neural network design. Martin Hagan, 2014.

[7] Nidhi Srivastav and Rama Krishna Challa. Novel intrusion detectionsystem integrating layered framework with neural network. In AdvanceComputing Conference (IACC), 2013 IEEE 3rd International, pages682–689. IEEE, 2013.

[8] Reyadh Shaker Naoum, Namh Abdula Abid, and Zainab Namh Al-Sultani. An enhanced resilient backpropagation artificial neural networkfor intrusion detection system. International Journal of ComputerScience and Network Security (IJCSNS), 12(3):11, 2012.

[9] Chunlin Zhang, Ju Jiang, and Mohamed Kamel. Intrusion detec-tion using hierarchical neural networks. Pattern Recognition Letters,26(6):779–791, 2005.

[10] Rebecca Petersen. Data mining for network intrusion detection: Acomparison of data mining algorithms and an analysis of relevantfeatures for detecting cyber-attacks, 2015.

[11] Rupesh K Srivastava, Klaus Greff, and Jurgen Schmidhuber. Trainingvery deep networks. In Advances in neural information processingsystems, pages 2377–2385, 2015.

[12] Jonathan A Silva, Elaine R Faria, Rodrigo C Barros, Eduardo RHruschka, Andre CPLF de Carvalho, and Joao Gama. Data streamclustering: A survey. ACM Computing Surveys (CSUR), 46(1):13, 2013.

[13] Garcia-Teodoro et. al. Anomaly-based network intrusion detection:Techniques, systems and challenges. computers & security, 28(1):18–28, 2009.

[14] Harjinder Kaur, Gurpreet Singh, and Jaspreet Minhas. A review ofmachine learning based anomaly detection techniques. arXiv preprintarXiv:1307.7286, 2013.

[15] Taeshik Shon and Jongsub Moon. A hybrid machine learning approachto network anomaly detection. Information Sciences, 177(18):3799–3821, 2007.

[16] Taeshik Shon, Yongdae Kim, Cheolwon Lee, and Jongsub Moon. Amachine learning framework for network anomaly detection using svmand ga. In Information Assurance Workshop, 2005. IAW’05. Proceedingsfrom the Sixth Annual IEEE SMC, pages 176–183. IEEE, 2005.

[17] Anna L Buczak and Erhan Guven. A survey of data mining andmachine learning methods for cyber security intrusion detection. IEEECommunications Surveys & Tutorials, 18(2):1153–1176, 2016.

[18] Ke Wang and Salvatore J Stolfo. Anomalous payload-based networkintrusion detection. In RAID, volume 4, pages 203–222. Springer, 2004.

[19] Miao Xie, Jiankun Hu, Song Han, and Hsiao-Hwa Chen. Scalablehypergrid k-nn-based online anomaly detection in wireless sensornetworks. IEEE Transactions on Parallel and Distributed Systems,24(8):1661–1670, 2013.

[20] Srinivas Mukkamala, Andrew H Sung, and Ajith Abraham. Intrusiondetection using an ensemble of intelligent paradigms. Journal ofnetwork and computer applications, 28(2):167–182, 2005.

[21] Xiangzeng Zhou, Lei Xie, Peng Zhang, and Yanning Zhang. Anensemble of deep neural networks for object tracking. In ImageProcessing (ICIP), 2014 IEEE International Conference on, pages 843–847. IEEE, 2014.

[22] Mahmood Yousefi-Azar, Vijay Varadharajan, Len Hamey, and UdayTupakula. Autoencoder-based feature learning for cyber securityapplications. In Neural Networks (IJCNN), 2017 International JointConference on, pages 3854–3861. IEEE, 2017.

[23] Ahmad Javaid, Quamar Niyaz, Weiqing Sun, and Mansoor Alam. Adeep learning approach for network intrusion detection system. InProceedings of the 9th EAI International Conference on Bio-inspiredInformation and Communications Technologies (formerly BIONETICS),pages 21–26. ICST (Institute for Computer Sciences, Social-Informaticsand Telecommunications Engineering), 2016.

[24] Mayu Sakurada and Takehisa Yairi. Anomaly detection using autoen-coders with nonlinear dimensionality reduction. In Proceedings of theMLSDA 2014 2nd Workshop on Machine Learning for Sensory DataAnalysis, page 4. ACM, 2014.

[25] T Jayalakshmi and A Santhakumaran. Statistical normalization andback propagationfor classification. International Journal of ComputerTheory and Engineering, 3(1):89, 2011.

[26] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization formachine learning. Mit Press, 2012.

[27] Leon Bottou. Stochastic gradient descent tricks. In Neural networks:Tricks of the trade, pages 421–436. Springer, 2012.

[28] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learn-ing machine: a new learning scheme of feedforward neural networks.In Neural Networks, 2004. Proceedings. 2004 IEEE International JointConference on, volume 2, pages 985–990. IEEE, 2004.

[29] Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert.An uncertain future: Forecasting from static images using variationalautoencoders. In European Conference on Computer Vision, pages 835–851. Springer, 2016.

[30] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, andPierre-Antoine Manzagol. Stacked denoising autoencoders: Learninguseful representations in a deep network with a local denoising criterion.Journal of Machine Learning Research, 11(Dec):3371–3408, 2010.

[31] Yusuke Sugiyama and Kunio Goto. Design and implementation of anetwork emulator using virtual network stack. In 7th InternationalSymposium on Operations Research and Its Applications (ISORA08),pages 351–358, 2008.

[32] Eric Leblond and Giuseppe Longo. Suricata idps and its interactionwith linux kernel.

[33] Borja Merino. Instant Traffic Analysis with Tshark How-to. PacktPublishing Ltd, 2013.

[34] Fionn Murtagh and Pedro Contreras. Methods of hierarchical clustering.arXiv preprint arXiv:1105.0121, 2011.

[35] Wenke Lee, Salvatore J Stolfo, et al. Data mining approaches forintrusion detection. In USENIX Security Symposium, pages 79–93. SanAntonio, TX, 1998.

[36] Mingrui Wu and Jieping Ye. A small sphere and large marginapproach for novelty detection using training data with outliers. IEEEtransactions on pattern analysis and machine intelligence, 31(11):2088–2092, 2009.

[37] Niladri Sett, Subhrendu Chattopadhyay, Sanasam Ranbir Singh, andSukumar Nandi. A time aware method for predicting dull nodes andlinks in evolving networks for data cleaning. In Web Intelligence (WI),2016 IEEE/WIC/ACM International Conference on, pages 304–310.IEEE, 2016.

[38] Deepthy K Denatious and Anita John. Survey on data mining techniquesto enhance intrusion detection. In Computer Communication andInformatics (ICCCI), 2012 International Conference on, pages 1–5.IEEE, 2012.

[39] Suricata — open source ids / ips / nsm engine. https://suricata-ids.org/,11 2017. (Accessed on 11/14/2017).

[40] Index of /open/suricata/rules. https://rules.emergingthreats.net/open/suricata/rules/, 11 2017. (Accessed on 11/14/2017).

[41] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. InData Mining, 2008. ICDM’08. Eighth IEEE International Conferenceon, pages 413–422. IEEE, 2008.

[42] Douglas Reynolds. Gaussian mixture models. Encyclopedia of biomet-rics, pages 827–832, 2015.

[43] Sylvain Calinon and Aude Billard. Incremental learning of gesturesby imitation in a humanoid robot. In Proceedings of the ACM/IEEEinternational conference on Human-robot interaction, pages 255–262.ACM, 2007.

[44] Yisroel Mirsky, Tal Halpern, Rishabh Upadhyay, Sivan Toledo, andYuval Elovici. Enhanced situation space mining for data streams. InProceedings of the Symposium on Applied Computing, pages 842–849.ACM, 2017.

[45] Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubin-stein, and JD Tygar. Adversarial machine learning. In Proceedings ofthe 4th ACM workshop on Security and artificial intelligence, pages43–58. ACM, 2011.

15