anomaly detection with autoencoders for heterogeneous...

35
Vrije Universiteit Amsterdam Universiteit van Amsterdam Master Thesis Anomaly Detection with Autoencoders for Heterogeneous Datasets Author: Philip Roeleveld (2586787) 1st supervisor: Wan Fokkink daily supervisor: Jan van der Vegt (Cubonacci) 2nd reader: Vincent François-Lavet A thesis submitted in fulfillment of the requirements for the joint UvA-VU Master of Science degree in Computer Science November 14, 2020

Upload: others

Post on 14-Feb-2021

13 views

Category:

Documents


0 download

TRANSCRIPT

  • Vrije Universiteit Amsterdam Universiteit van Amsterdam

    Master Thesis

    Anomaly Detection withAutoencoders for Heterogeneous

    Datasets

    Author: Philip Roeleveld (2586787)

    1st supervisor: Wan Fokkinkdaily supervisor: Jan van der Vegt (Cubonacci)2nd reader: Vincent François-Lavet

    A thesis submitted in fulfillment of the requirements forthe joint UvA-VU Master of Science degree in Computer Science

    November 14, 2020

  • Abstract

    Autoencoders are a type of neural network that work to recon-struct input data as best as possible after encoding it in lowerdimensionality. An important application of Autoencoder mod-els is in anomaly detection, where normal data is more easilyreconstructed than anomalous data. In this thesis the standardAutoencoder is adapted in multiple ways to incorporate categori-cal data features. This leads to six different Autoencoder modelsthat are compared to three other well-known anomaly detectionmethods. However, research efforts in anomaly detection are facedby the problem that there is a lack of a standardized method toevaluate the performance of anomaly detection methods. In thisthesis we also construct a benchmark consisting of data from 19real-world datasets to compare the six Autoencoder models andthe three other anomaly detection methods.

  • CONTENTS

    Contents

    1 Introduction 41.1 Cubonacci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2 Related Work 72.1 Evaluation of Anomaly Detection . . . . . . . . . . . . . . . . 72.2 Heterogeneous Data . . . . . . . . . . . . . . . . . . . . . . . 7

    3 Method 93.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 12

    4 Categorical features 14

    5 Models 165.1 Modifying Autoencoders for Heterogeneous Data . . . . . . . 165.2 Alternative Anomaly Detection Methods . . . . . . . . . . . . 205.3 Anomaly Score & Thresholding . . . . . . . . . . . . . . . . . 21

    6 Experiments & Evaluation 236.1 Parameter Search . . . . . . . . . . . . . . . . . . . . . . . . . 236.2 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . . . 246.3 Evaluation on Unseen Datasets . . . . . . . . . . . . . . . . . 25

    7 Conclusion 31

    8 Discussion & Future Work 32

    References 33

    3

  • 1 INTRODUCTION

    1 Introduction

    Anomaly detection is a difficult problem of scanning for abnormal points ina surrounding dataset. These abnormal points are referred to as anomalies,outliers or novelties. What constitutes an anomaly in practice depends on thecontext and the distribution of the surrounding data. A prevailing generaldefinition is due to Hawkins [1]: “[An outlier is] an observation which deviatesso much from other observations as to arouse suspicions that it was generatedby a different mechanism.”

    Anomaly detection has many applications ranging from the detection ofcredit card fraud to network intrusions. Data points that are detected tobe anomalous usually warrant manual inspection by an operator. Anomalydetection is thus a tool that reduces the work of monitoring of a vast datasetto the inspection of a few anomalous points of interest.

    Many methods have been developed to address the task of anomaly detec-tion. More recently these methods have also included machine learning ap-proaches, employing neural networks in particular. Machine learning methodshave been conceived for both supervised and unsupervised settings, meaningthat it is known or unknown, respectively, which points are anomalous andwhich are not, for some subset of the data. Among these methods are Au-toencoders [2]. An Autoencoder is a neural network consisting of an encoderand a decoder. The encoder reduces the dimensionality of the input data asit is propagated through, resulting in an encoded vector of much smaller sizethan the input. The decoder has the task of reconstructing the input fromthe relatively little information left in the encoding. The difference betweenthis reconstruction and the original input, called the reconstruction error, isused as a loss function to train the network via back-propagation. See figure1 for an example of what an Autoencoder might look like.

    Originally, Autoencoders arose as a method for dimensionality reduction,and later to generate new data that resembles the input data. To apply such amodel to anomaly detection, recognize that in theory an Autoencoder shouldhave a much easier time reconstructing normal data, than it has reconstruct-ing anomalous data, because anomalies are sparse and the model is trainedon predominantly normal data. And so the reconstruction error can be usedas a measure of how anomalous a datapoint is. This approach goes at leastas far back as the work by Hawkins et al. [3].

    Of the multitude of methods that exist for anomaly detection many aregeared towards a specific type of data, or the structure and quirks of the dataof interest are at least known beforehand, and used to tweak the anomalydetection model for that dataset. In this thesis we focus instead on thecase when very little is known about the data beforehand; having only thedimensionality of the data as well as readily available statistics to go off. Thismeans that the problem is not only unsupervised, but there are no heuristics

    4

  • 1.1 Cubonacci

    based on the type of data to guide in configuring any model. The reasonto focus on this general anomaly detection case is because of the particularapplication of anomaly detection in industry by Cubonacci.

    1.1 Cubonacci

    Cubonacci [4] is a machine learning lifecycle management platform. Users ofCubonacci define their machine learning models, and specify how to load theirdata to be consumed by those models. A typical use of the platform has usersadopt the programming interface such that their machine learning model canbe detected by the platform, define a data loader to supply the model withdata, and possibly configure various metrics to evaluate the performance ofthe model. Afterwards, the model can be trained using the console, andhyperparameters can be changed interactively. Finally a trained model canbe deployed via a web API. The platform manages everything below thesurface, from the overall training pipeline down to the allocation of computingresources in a cloud environment.

    Because Cubonacci manages loading of the users’ data, this gives the plat-form an opportunity to inspect the structure of the data, and even train ananomaly detection model that can be deployed alongside any regular modeldeployment by a user. This model could check incoming data from the live de-ployment and potentially alert users of any suspicious outliers. As the modelwould see the same data as the user-deployed model, consistent detection ofanomalies could indicate to the user that the output of their own model mightno longer be trustworthy.

    For the typical use case, datasets that are supplied to initially train modelsbefore deployment are often sterilized and free of anomalies, while the dataseen by the model that is deployed live will inevitably encounter anomalies.This observation will play a role in constructing benchmark datasets for thepurpose of testing anomaly detection methods for this application, as thetraining part of a benchmark dataset should similarly be free of anomalieswhile the testing data should contain anomalies.

    Because of the unknown nature of the data of any particular user, thereis no prior knowledge for any anomaly detection method to take advantageof. And thus there is a need for a general approach that works reasonably forany dataset. This “any dataset” criterion does not only encompass numericaldatasets of varying distributions. A dataset might also contain non-numericaldimensions. Examples are body-of-text data, time-series data such as times-tamps, or categorical data which could be as simple as a yes/no data column.(We refer with “feature” to any such a singular data attribute.) Since usersare free to mix and match any of these data types in so-called heterogeneousdatasets, any anomaly detection model must either have ways to incorpo-rate features of each type, or ignore some types altogether, which mighthide anomalies that would otherwise be easy to spot. To limit the scope

    5

  • 1 INTRODUCTION

    of this research somewhat, we will only consider datasets that have nothingbut numerical and categorical features. Even with this restricted view, mostanomaly detection models only work for numerical data.

    1.2 Overview

    In this thesis we use various methods to adapt the Autoencoder methodfor heterogeneous datasets with numerical and categorical features, and inparticular, define two new loss functions for such data.

    The resulting six Autoencoder models are compared to three other well-known anomaly detection methods, being Local Outlier Factor [5], IsolationForest [6], and One-class support vector machine [7] models. But part of thedifficulty in answering the question of a general purpose model lies in thefact that different methods and configurations yield good results for differentdatasets. Moreover, the best approach is usually dependent on the type of thedata. When determining the best approach in the general case, there is thusa need for a standardized approach to the evaluation of the different anomalydetection methods. The lack of such a standardized evaluation scheme hasbeen a known issue for anomaly detection research efforts, but efforts havebeen made to resolve this. The work by Emmott et al. [8] in particularintroduces a systematic benchmark that we will utilize as a framework toevaluate the anomaly detection models.

    Section 2 goes over related work, in particular the evaluation of anomalydetection methods and methods for dealing with heterogeneous data. In sec-tion 3 we describe the methods used to evaluate the various anomaly detectionmodels and the construction of a systematic benchmark. Section 4 contains adiscussion of how to incorporate categorical features in numerical-based mod-els. Section 5 lists the models to be tested, and the new Autoencoder modelsin particular. Results are presented in section 6, followed by a conclusion insection 7 and a final discussion in section 8.

    6

  • 2 Related Work

    Over the years many anomaly detection methods have been proposed. Fol-lowing this, many surveys have been conducted as well to map out all ofthe disparate methods throughout the various fields where they are utilized.Chandola et al. [9] provide a broad overview of the methods used in the differ-ent areas of research as well as advantages, disadvantages, and assumptionsmade. More recently, Chalapathy and Chawla [10] look into the applicationof deep learning techniques for anomaly detection.

    2.1 Evaluation of Anomaly Detection

    A notorious problem faced by anomaly detection researchers lies in the lack ofa standardized framework for the evaluation of different methods. Researchersrun into multiple problems when it comes to evaluating their work. Often-times only one dataset is used, or very few datasets, usually within a specificarea of research. This means that it is unknown whether the method gener-alizes to other types of data, and it is impossible to tell whether a method isgenerally superior to the alternatives or just happens to be better suited tothe data at hand [11]. Furthermore some works rely exclusively on syntheticdatasets to evaluate the proposed method, which is problematic because itmight not at all be representative for a real-world application. Finally theusage of vastly different datasets results in incomparable results commonlybeing published.

    To address these issues, Emmott et al. [8] propose a systematic benchmarkfor the evaluation of anomaly detection methods. They propose splittingreal-world classification datasets into two categories by splitting the labelsinto two groups. One of these groups is declared as normal, and the other asanomalous. The anomalous data can then be subsampled and added to thenormal data to generate a realistic benchmark. This framework has also beenadopted by other researchers [12].

    Campos et al. [11] attempt to address the same issues. They use a similarapproach to Emmott et al., using a multitude of datasets and downsamplingthe anomaly class, but they focus on only datasets that have either commonlybeen used in anomaly detection research, or have a semantic interpretation ofanomalies. Additionally, they introduce new measures for the performance ofunsupervised anomaly detection methods to aid the commonly used AUROCmetric.

    2.2 Heterogeneous Data

    Since most anomaly detection methods are designed for numerical data, animportant concern for a general purpose model is to facilitate non-numericaldata, the most important of which would be categorical data. An early ap-

    7

  • 2 RELATED WORK

    proach to this problem of combining numerical and categorical data is due toGhoting et al. [13]. Their model focuses on the categorical data, using fre-quent itemset mining to determine anomalies. Numerical data is added to themodel by considering the matrix of Pearson correlation coefficients betweenthe numerical dimensions for each itemset.

    Zhang and Jin [14] propose to train a logistic regression classifier for eachof the categorical features, using the numerical data as input. The idea is thatthese classifiers will perform well for normal data and poorly for anomalousdata. To determine whether a point is anomalous, this is combined with aseparate metric that captures how anomalous the numerical data is on its own,namely the distance to the kth nearest neighbor. Note that the underlyingframework and the idea of training classifiers for the categorical data is theinteresting contribution, more so than the specific classifiers and numericalmetric. Unfortunately the weakness of this model is that it is blind to possiblecorrelations among the categorical features.

    Because of the popularity of neural networks, anomaly detection modelsbased on them such as Autoencoders enjoy much further research as to theadoption of categorical data. We refer to a survey by Hancock and Khosh-goftaar [15] for an overview. One particularly popular method called entityembeddings [16] has been used in multiple anomaly detection models [17, 18].

    8

  • 3 Method

    We adopt the method proposed by Emmott et al. [8] for the systematic con-struction of a benchmark from a multitude of classification datasets. The rea-son to use their method to construct our own benchmark rather than adoptan existing benchmark such as the one made available by Campos et al. [11]is motivated by the need for a multitude of heterogeneous datasets. Followingtheir method, all of the datasets are obtained from the publicly available UCImachine learning repository [19], with the exception of the famous MNISTdatabase of handwritten digits [20], which is also included because it consistsof both many datapoints and many features. From the UCI repository, weselect datasets based on the following criteria, resulting in a benchmark of 19different datasets1:

    • Classification task

    • At least 6000 instances

    • No time-series, sequential, or body-of-text features

    • At most 1000 features

    The ‘Classification task’ criterion means that the dataset is labeled withclasses, as opposed to a ‘Regression task’ dataset that is labeled with numer-ical values. Emmott et al. do include regression datasets, but their methodto convert these datasets for the benchmark does not fall in line with theusual definition of anomalies as having been generated by a separate processfrom the normal data. As such, we do not include regression datasets in thebenchmark.

    Since we are not interested in the question which models perform bestwhen there is little data to work with, it is important to ensure that datasetscontain at least enough data to train the models properly. The repositorycontains many datasets of no more than a few hundred instances, so thecriterion of ‘At least 6000 instances’ is necessary, although the exact numberis somewhat arbitrary.

    The exact number in the ‘At most 1000 features’ criterion is similarlyarbitrary. This criterion is used for two reasons. First, as the number offeatures has a significant impact on computation time, limiting it ensuresthat no individual dataset takes too long to process. Second, datasets withthat many features would in practice be preprocessed with some form ofdimensionality reduction before being supplied to the main model, which isbeyond the scope of this thesis.

    1Due to time constraints some datasets in the repository that do meet the criteria werenot included nonetheless. Those datasets that were selected, were done so blindly as to notintroduce any bias. See table 1 in the appendix for the selected datasets.

    9

  • 3 METHOD

    To construct the benchmark we take each of the selected datasets andsplit the points into two classes, normal and anomalous. For binary classifi-cation datasets, the larger class is taken as normal and the smaller class asanomalous. For datasets with multiple classes a Random Forest classifier isfitted to the dataset to obtain a confusion matrix between the classes. Theentries of this matrix are then seen as the edge weights of a fully connectedgraph with the classes as nodes. Finally, two-coloring the maximum weightspanning tree of this graph results in two subsets of classes that are maximally“confusable” with each other. The color that contains the most datapoints istaken as normal and the other as anomalous.

    The anomalous points of each dataset are subdivided into four categoriesof increasing difficulty, using kernel logistic regression to estimate the diffi-culty of each point. Rather than computing the full kernel for each datasetthe Nyström method [21] is used to approximate the kernels, which reducescomputation time. The points from these difficulty levels can then be sampledto generate dataset instances of that difficulty, by combining those sampledpoints with the normal data. In addition to a point difficulty, each instancehas a frequency of anomalies relative to normal points that is configurable.For the benchmark we generate an instance for each of the four difficultylevels, for each of three relative frequencies from 0.001, 0.01, and 0.1. Whenthere are not enough anomalies of a particular difficulty to meet the higherrelative frequencies, that instance is discarded. The normal datapoints aresplit into training and testing data at a ratio of four to one.

    As explained in section 1.1, we only add the anomalous datapoints to thetest sets, keeping the training data clean of anomalies. This is in contrastwith the work of Emmott et al. where anomalies are also added to the trainingdata. Another distinction is that Emmott et al. use two different samplingstrategies for the anomalies. One is to choose a seed point and sample all ofits nearest neighbors as an instance for maximal clusteredness. The other isto use a facility location algorithm to achieve the opposite effect of samplinganomalies that are as far apart from each other as possible. We forego thesesampling methods and simply sample points randomly. This is motivatedby multiple reasons. Firstly, it is a difficult problem in its own right todetermine a suitable distance metric for the above two sampling methodswhen it comes to heterogeneous data. Secondly, since we keep the trainingdata clean regardless, the sampling method will not affect model performance,especially when the points are already distributed among four difficulty levels.

    After constructing this benchmark with multiple instances of multipledatasets, some of the datasets were used in the research itself and to conducta hyperparameter search for the best configuration of each of the models. Therest of the datasets were held out for evaluation purposes to avoid unknow-ingly biasing the models towards the benchmark during the development.The datasets that were held out until evaluation are highlighted in table 1.

    10

  • Table 1: Datasets used for experiments. Those in boldface were held out untilevaluation and not used for parameter search. The Features column containsthe total number of features, as well as the number of categorical features inbrackets where applicable.

    Short name Full name in UCIrepository

    Size Features DatasetInstances

    Adult Adult 48842 14 (8) 12AnuranCalls Anuran

    Calls (MFCCs)7195 22 10

    APSFailure APS Failure at ScaniaTrucks

    60000 170 2

    Avila Avila 20867 10 12BankMarketing Bank Marketing 45211 19 (10) 11ClaveDirection Firm-Teacher_Clave-

    Direction_Classificat-ion

    10800 16 12

    Covertype Covertype 581012 12 (2) 12Diabetes Diabetes 130-US

    hospitals for years1999-2008

    100000 48 (39) 9

    HTRU2 HTRU2 17898 8 9ISOLET ISOLET 7797 617 12MiniBooNE MiniBooNE

    particleidentification

    130065 50 12

    MNIST - 70000 784 12Mushroom Mushroom 8124 22 (22) 11Musk Musk (Version 2) 6598 166 12Nursery Nursery 12960 8 (8) 9OnlineShoppers Online Shoppers

    PurchasingIntention Dataset

    12330 18 (8) 10

    SensorlessDrive-Diagnosis

    Dataset for SensorlessDrive Diagnosis

    58509 48 12

    Shuttle Statlog (Shuttle) 58000 9 9Thyroid Thyroid Disease 7200 21 (15) 9

    11

  • 3 METHOD

    3.1 Evaluation Metrics

    Equally as important as the benchmark itself are the metrics used to com-pare the anomaly detection models to each other. In literature the metricmost commonly used for this task is the Area Under the Receiver OperatingCharacteristic curve, or AUROC, also abbreviated as ROC AUC or just AUC(Area Under Curve) in other works. This metric can be obtained as follows.Suppose we have a binary classifier that assigns a numerical value to eachdatapoint. An example could be an anomaly detector that assigns low valuesto normal points and high values to anomalous points. To turn this modelinto a true classifier we would need to specify a threshold below which pointsare considered normal and above which points are considered anomalous. Wecan think of this assigning of values as scoring the datapoints according tohow anomalous they are perceived to be. Given such a scoring classifier, theReceiver Operating Characteristic (ROC) curve is obtained by plotting thetrue positive rate along the vertical axis, against the false positive rate alongthe horizontal axis, for all possible threshold values. The AUROC is simplythe area under this ROC curve. In the context of anomaly detection, the truepositive rate would be the proportion of true anomalies that are correctlyidentified as anomalies, and the false positive rate would be the proportion oftrue normal points that are falsely identified as anomalies. A perfect classi-fier would rank all normal points below all anomalous points, resulting in anAUROC of 1, whereas a perfectly random classifier would result in an ROCcurve straight from (0, 0) to (1, 1) and an AUROC of 0.5.

    The reason why AUROC is so popular in literature is because of two desir-able properties. The first of these is its objective nature. Since it aggregatesover all possible threshold values, no subjectivity is required by the researcherto specify any single threshold. Second, by relying on the true positive andfalse positive rates, AUROC corrects for class imbalance, which is an impor-tant property for anomaly detection in particular. The popularity of AUROCalso means it better facilitates comparison of results between different papers.For this reason, the AUROC metric will be included in the results. But theAUROC metric is not without an important weakness; it does not take intoaccount the possibility that the two different types of misclassification mighthave different costs associated with them. In some settings letting an anomalyslip by unnoticed might be of much greater cost than an additional manualinspection of a false positive, while in other settings it might not be so catas-trophic to miss an anomaly and the cost of manual inspection is generallylarger. To address this, although AUROC values are included, we will shiftthe focus towards various accuracy metrics instead.

    Accuracy is perhaps the simplest evaluation metric. It is simply the pro-portion of data points that have been classified correctly. Let P and Ndenote positive (anomalous) and negative (normal) points respectively, andTP and TN denote true (correct) classifications. We write FN = P − TP

    12

  • 3.1 Evaluation Metrics

    and FP = N−TN for the false (incorrect) classifications. Then the accuracyis as follows.

    Accuracy =TP + TN

    P +N= 1− FN + FP

    P +N.

    Like AUROC, the accuracy metric above also weighs false positives and falsenegatives equally. Another way to phrase this is to say that the metric isrepresentative when the costs of both types of misclassification are equal. Toweigh the accuracy for representing different costs, let c1 be the cost of afalse negative (missed anomaly) and c2 of a false positive (wrongly detectedanomaly), the c1:c2 weighted accuracy is defined as:

    Accuracy c1 : c2 = 1−c1FN + c2FP

    c1P + c2N.

    Note that to adjust for class imbalance, the weights must be applied in thedenominator of the above fraction as well. In the results below we will use15:1, 1:1, and 1:15 weighted accuracy to represent three different cost scenar-ios. These three metrics together combined with AUROC are used to evaluatethe various anomaly detection models.

    In their work, Campos et al. [11] use other metrics in addition to AUROC.However all of the metrics they employ are precision based metrics, meaningthat they do not explicitly measure the cost of false negatives, i.e. missedanomalies. As such, the trade-off between the two misclassification costscannot be set directly.

    13

  • 4 CATEGORICAL FEATURES

    4 Categorical features

    Most anomaly detection methods in literature focus solely on either numericalor categorical data. However a mix of both of these types is often availablein practice, and simply ignoring data of either type would mean letting thatdata go to waste. Ideally a good model would be free to consider data of bothtypes. To achieve this it is common to either transform numerical featuresinto categorical features or the other way around, depending on the chosenmodel. Transforming numerical features is usually done by means of binning(e.g. Fayyad and Irani [22]), where numerical values are discretized by par-titioning the range of possible values. To transform categorical features intonumerical features on the other hand is done by embedding the possible valuesin a vector space. This is often achieved simply with the one-hot encoding,where a categorical feature of cardinality n is mapped bijectively to the setof vectors in Rn that contain all zeroes except for a single 1. So for examplea categorical feature of cardinality 2, consisting of e.g. ‘Yes’ and ‘No’, couldmap ‘Yes’ to ( 10 ) and ‘No’ to ( 01 ). But there are also more complex meth-ods to transform categorical features to numerical, like word2vec [23] wheresome representation is learned of each category based on which other valuesit appears in conjunction with.

    For Autoencoders we are interested in transforming the categorical fea-tures to numerical vectors to fit the neural network structure. A simplemethod like one-hot encoding can be used, but the Autoencoder will thenattempt to imitate the data as it is after the one-hot encoding step. Thismeans that a loss function such as the mean squared error does not neces-sarily capture the reconstruction error for the original categorical featuresproperly. This problem can be addressed by introducing a second loss thatis more suitable for categorical features such as binary cross-entropy. How-ever it is not obvious how to weigh the numerical and categorical terms ofthis new combined loss function against each other as to maximally improvethe learning process, and this weighing can be considered as an additionalhyperparameter of the model.

    Alternatively, the above problem can be avoided altogether by introducingentity embeddings [16]. That is, if we first pass each of the one-hot encodingsthrough a fully connected layer and concatenate the outputs of all of theselayers together as well as the numerical features, the problem has succesfullybeen translated to a numerical vector space. This concatenated numericalvector will then serve as the input for the Autoencoder, where a loss functionsuch as mean squared error is now suitable. The drawback of this approach isthat it is very difficult to train such a model without the entity embeddingsdeteriorating to a trivial embedding. This can be alleviated somewhat bytraining twice, using only the numerical features as output during the firstsession, and freezing the embeddings for the second session to focus on the

    14

  • Autoencoder itself.Another approach is to train shallow auxiliary neural networks for each

    of the categorical features, such that each network learns a mapping from thespace of the one-hot encodings of its feature, to the space containing the restof the features. The output of the penultimate layer in such a network couldthen be used as an embedding with the nice property that a linear combina-tion of its values matches the values of other features that correlate with theembedded category. Additionally if there is some overlap between two cate-gories in terms of how they correlate with the values of other features, theirembeddings will be close together. In general there will be many more otherfeatures than the dimensionality of any single feature, so the reconstructionerror of any auxiliary network will never be very low. But this is no problem,as the goal is not to find an embedding that actually fully encodes the otherfeatures (an impossible task, in general) but rather to find the embeddingthat encodes as many of the other features as possible. Unfortunately thisapproach still runs into the problem of needing a loss function that properlycaptures the reconstruction error for the remainder of the features, transfer-ring this problem from the main Autoencoder to these auxiliary networks.

    15

  • 5 MODELS

    5 Models

    The benchmark from section 3 is used to measure the performance of multi-ple anomaly detection methods. These methods are listed in the following,starting with multiple variants of the Autoencoder model.

    5.1 Modifying Autoencoders for Heterogeneous Data

    First, it is important to specify the exact shape of the Autoencoder. As thisdepends on the shape of the data, it needs to be configured separately foreach dataset in the benchmark. To achieve this, we use the dimension of theinput vector d and heuristically set the dimension of the encoding to be the(rounded) square root of the input dimension de = b

    √d+0.5c. This forces the

    model to learn a representation of the input in fewer dimensions, but not sofew that it becomes impossible to distinguish between instances. The denselayers between the input and the encoding layer are calculated outwards,multiplying the encoding dimension by a constant factor k up to the largestmultiple less than d. These layers are mirrored to form the decoder, with anadditional final layer of dimension d. For example, if d = 100, k = 3 thende = 10 and the dimensions of the Autoencoder layers are 90, 30, 10, 30, 90,then 100. Another example is given in figure 1. The value k will be calledthe layer factor, and it is a hyperparameter of the model.

    Figure 1: Autoencoder with three numerical features; v1, v2, v3 ∈ R, and twocategorical features; Blue ∈ {Red, Blue, Green} and Yes ∈ {Yes, No}. Afterone-hot encoding d = 8 and de = 3. With k = 2 the resulting layer dimensionsare 6, 3, 6, then 8.

    16

  • 5.1 Modifying Autoencoders for Heterogeneous Data

    As discussed in section 4 there are multiple ways to modify the Autoen-coder model for heterogeneous data. We will test six different variants, in-cluding the base Autoencoder using only the one-hot encoding for categoricalfeatures.

    One of the modified models uses entity embeddings. Each one-hot encodedfeature is fed into a single dense layer of dimension equal to the rounded squareroot of the number of classes of that feature (i.e. the length of the one-hotencoding). The outputs of each of these dense layers are concatenated andcombined with the numerical features. The resulting vector serves as theinput for the Autoencoder. See figure 2 for an example structure.

    Figure 2: Autoencoder with entity embeddings, using the same example dataas figure 1.

    Another modification is to use auxiliary neural networks to embed theone-hot encodings. A network is trained for each of the categorical featuresas an encoder of the remaining features. When the instances in the train-ing set are fed into such a network, only the one-hot encoding of the featureat hand is propagated through the network. The dimension of the outputis equal to that of the remaining numerical features plus encodings of othercategorical features. The loss function used will measure the reconstructionerror compared to these remaining features. Each auxiliary network consistsof three layers. The dimension of the last layer is already evident. The dimen-sion of the middle layer is the rounded square root of the input dimension,and the dimension of the first layer is found by linear interpolation betweenthe input dimension and the dimension of the middle layer. Figure 3 containsan example structure of such a model with an auxiliary network.

    To introduce nonlinearity to each of these three Autoencoder variants,

    17

  • 5 MODELS

    Figure 3: Autoencoder with an auxiliary network, with three numerical fea-tures; v1, v2, v3 ∈ R, and one categorical feature; C2 ∈ {C1, . . . , C5}. As such,the output of the auxiliary network is three-dimensional. Concatenating theembedding and the numerical features results in d = 5 and de = 2. With k = 2the resulting layer dimensions of the Autoencoder are 4, 2, 4, then 5.

    all layers except for the last layer in the network are followed by a ReLUactivation function. The same is true for the auxiliary networks. The entityembedding layer has no activation.

    Finally we define three different loss functions to use with the Autoencodermodel. All three functions will be equivalent for a dataset consisting onlyof numerical features, but not so when categorical features are introduced.The equivalent loss for numerical features shall be the mean squared errorbetween the output of the model and the input. For categorical features wemight use a different loss function, and combine this with the mean squarederror term. One candidate for this loss function is the binary cross-entropybetween the one-hot encoding and the corresponding prediction in the outputvector, taking the softmax of that prediction to ensure it is a probabilitydistribution. A second approach is to take that softmax of the prediction,and consider only the predicted probability of the true class, adding it to themean squared error loss. This can perform well when the numerical featuresare normalized. We will need some notation to formalize these loss functions.Say we have a dataset D consisting of k categorical and n numerical features.For an instance I = (c1, . . . , ck, v) of D, we have v ∈ Rn representing thenumerical features and ci ∈ Ci where Ci is the set of all classes of the i-thcategorical feature. A one-hot encoding oi of ci would be as follows. If we

    18

  • 5.1 Modifying Autoencoders for Heterogeneous Data

    order Ci as Ci = {C1i , . . . , C|Ci|i }, and ci = C

    ji , then oi is a vector of all zeros

    except for a 1 in the j-th position. If we write oi = (o1i , . . . , o|Ci|i ), we can

    write the instance I ′ after one-hot encoding I as follows.

    I ′ =

    ((o11, . . . , o

    |C1|1

    ), . . . ,

    (o1k, . . . , o

    |Ck|k

    ), v

    ),

    or simplyI ′ =

    (o11, . . . , o

    |C1|1 , . . . , o

    1k, . . . , o

    |Ck|k , v

    ). (1)

    With the above notation we can define the three loss functions.

    Definition 1. Let I ′ be a one-hot encoded input vector as in equation (1)and O = (p11, . . . , p

    |Ck|k , u) an output vector in the same space. Let σ(pi) be

    the softmax function of pi, that is,

    σ(pi)j =ep

    ji∑|Ci|

    m=1 epmi

    for 1 ≤ j ≤ |Ci| .

    The Mean Squared Error (MSE) loss between O and I ′ is:

    1

    n+ |C1|+ · · ·+ |Ck|

    n∑m=1

    (vm − um)2 +k∑

    i=1

    |Ci|∑j=1

    (oji − p

    ji

    )2 . (2)The MSE/Cross-entropy loss between O and I ′ is:

    1

    n

    n∑m=1

    (vm − um)2−wk∑

    i=1

    1|Ci|

    |Ci|∑j=1

    oji log(σ(pi)j) + (1− oji ) log(1− σ(pi)j)

    .(3)

    The MSE/Softmax loss between O and I ′ is:

    1

    n+ k

    n∑m=1

    (vm − um)2 + wk∑

    i=1

    |Ci|∑j=1

    oji (1− σ(pi)j)2

    . (4)Note that all three loss functions are equal when no categorical features

    are present, i.e. k = 0. The MSE/Cross-entropy loss function is simply thesum of the MSE of the numerical features, and a binary cross-entropy termfor each individual categorical feature. The MSE/Softmax loss function issimilar, but note that all but one oji are zero, thus only one entry of thesoftmax σ(pi) is used, namely the one corresponding to the true class. Thissoftmax entry is used as if it is a numerical feature in a MSE loss.

    Both MSE/Cross-entropy and MSE/Softmax have a parameter w. Thevalue of this parameter determines the weighing between the two functions

    19

  • 5 MODELS

    used for numerical and categorical features respectively. It is a hyperparam-eter for the overarching model as discussed in section 4.

    MSE/Cross-entropy and MSE/Softmax are only useful when applied tothe one-hot encodings themselves, so if the encodings have been embeddedusing either entity embeddings or auxiliary networks as embedders, MSE lossis used instead. Though the composite loss functions can be used to mea-sure the reconstruction error of the auxiliary networks. The six Autoencodervariants we test on heterogeneous data are as follows1:

    • Base Autoencoder (MSE loss)

    • MSE/Cross-entropy loss

    • MSE/Softmax loss

    • Using entity embeddings

    • Using auxiliary networks with MSE loss

    • Using auxiliary networks with MSE/Cross-entropy loss

    5.2 Alternative Anomaly Detection Methods

    The Autoencoder models are also compared to three other anomaly detectionmethods. These have been selected as they are all well-cited and known toperform well in general, and moreover they are known to be used in many dif-ferent applications of anomaly detection. The algorithms that are comparedare as follows:

    • Local Outlier Factor

    • Isolation Forest

    • One-Class Support Vector Machine

    Each of these methods work on numerical data. To handle categorical data,we employ one-hot encoding for each of the models.

    Local Outlier Factor (LOF) [5] is perhaps the most cited method foranomaly detection. It is a distance-based method that works by consider-ing local point neighborhoods. If a point is far away from all of its nearestneighbors, but those neighbors are comparatively not far apart from eachother, then the first point can be considered an outlier. This model dependson a single parameter, being the number of neighbors to consider.

    Isolation Forest [6] is a popular approach to anomaly detection based onthe observation that anomalies tend to be “isolated” from other points. The

    1Using auxiliary networks in tandem with the MSE/Softmax loss is not included dueto time constraints.

    20

  • 5.3 Anomaly Score & Thresholding

    degree of isolation is measured directly as the average number of partitionsneeded by a random decision tree to completely isolate the data-point. Tocompute this average number of partitions over multiple decision trees we endup with a forest. As such a parameter for this model is the size of the forest;the number of decision trees to use.

    In the original paper Liu et al. specify a second parameter ψ that de-termines the subsample size of data to train each decision tree with. Theynote that increasing the subsample size beyond a certain value (dependingon the dataset) does not significantly increase performance further. But theyalso show that performance certainly does not deteriorate by increasing thesubsample size. So to avoid having this second parameter we can instead usethe entire dataset, at the cost of some computation time.

    One-Class Support Vector Machine (OC-SVM) [7] is an approach that hasbeen studied extensively. It works by estimating the distribution of normaldata to obtain a decision boundary around this distribution. Any point thatlies beyond this boundary can then be considered anomalous. The methoddepends on a single parameter ν ∈ (0, 1) that is simultaneously an upperbound on the predicted fraction of anomalies in the dataset.

    In recent years, researchers have started incorporating Autoencoders inconjunction with OC-SVM models [24]. The idea is to use the Autoencoderas an unsupervised feature extractor, reducing the dimensionality of the OC-SVM by using the output of the encoding part of the Autoencoder. In contrastto the relatively new idea of combining OCSVM with Autoencoder-baseddimensionality reduction, Autoencoders themselves can of course also be useddirectly as an anomaly detection method [2, 25], which is how they are usedin this thesis.

    5.3 Anomaly Score & Thresholding

    Of all of the models we have discussed, only OC-SVM predicts a binary yes orno as to whether a particular datapoint is an anomaly. The other models allpredict a so-called anomaly score. This is a real number that gives a measureof how anomalous the model thinks a particular point is. Let’s say that ahigher score equates to a more anomalous datapoint. To turn this score into abinary prediction there is a need to find a threshold value such that any pointwith a score above the threshold is labeled as anomalous. This leads us tothe question of how this threshold is to be found. For Isolation Forest we canrefer to the analysis in the original paper [6] for determining the threshold.As for LOF and Autoencoder models we need another approach.

    To determine a thresholding method, recall that all dataset instances weregenerated such that no anomalous points are added to the training data(which translates to an assumption that the training data is “clean” in a realworld setting). With this in mind, suppose we have a model that perfectlyseparates the scores of normal points from those of anomalous points. The

    21

  • 5 MODELS

    best thresholding method to make use of this perfect model, with the knowl-edge (or under the assumption) that the training set is clean of anomalies,would be simply to use the maximum anomaly score from the training dataas the threshold. This is the thresholding method we use for the experiments.

    However, rather than using the maximum over the entire training data,part of the training data is held out for this purpose of finding the threshold.The reason for this is because certainly no model is perfect in practice. Andthus data that has never been seen by the model but is nonetheless normal canhave higher anomaly scores than the training data itself. So the maximumanomaly score of this holdout set will be higher, and hopefully result in abuffer that reduces the number of false positives. We have set this holdoutset somewhat arbitrarily to be 10% of the training data. This also meansthat both LOF and Autoencoder models will have 10% less data to use fortraining than Isolation Forest and OC-SVM models, because it is reserved forthresholding.

    The anomaly scores are also needed directly if we want to compute theAUROC metric. But OC-SVM functions without anomaly scores, using thedecision boundary that is constructed as a sort of threshold instead. Toimitate the anomaly score, we use the Euclidean distance from that decisionboundary as a proxy to compute AUROC, with a negative sign whenever apoint falls within the boundary. If we consider this as an anomaly score, thethreshold would thus be at 0.

    22

  • 6 Experiments & Evaluation

    The anomaly detection benchmark and models were implemented in Python.For the LOF, Isolation Forest, and OC-SVM models as well as miscellaneousalgorithms such as one-hot encoding and Nyström kernel logistic regression,we used implementations from the Scikit-learn library [26]. The various Au-toencoder models were implemented using Tensorflow [27].

    For every model except Isolation Forest, all numerical features of datasetinstances were scaled to have zero mean and unit variance. This preventsindividual features from dominating others that might be orders of magnitudesmaller in the original data. The reason to exclude Isolation Forest fromthis preprocessing step is that, unlike the other models, it does not dependon a distance metric across multiple features/dimensions. More specifically,because the decision trees only split along one dimension at a time, there isno need to normalize the features.

    6.1 Parameter Search

    To determine the best configuration for each of the models, a hyperparametersearch was performed using all datasets except those held out for evaluation.We used the accuracy metric to determine the best configuration. The can-didate values for the parameters of LOF and Isolation Forest are based onthe recommended values in literature. For every model the performance ofeach configuration is measured by the average accuracy across all 103 datasetinstances and the best performing configuration based on this metric is se-lected. All of the tested hyperparameter settings, as well as the resulting bestconfigurations, can be found in table 2.

    The six Autoencoder variants are equivalent for datasets that consist ex-clusively of numerical features. Moreover, the hyperparameter w that deter-mines the weighing between numerical and categorical loss terms is meaning-less for such datasets. It follows that the differences between the best con-figurations of the Autoencoder models are caused only by the four datasetsthat do contain categorical features. When it comes to evaluating the modelson the other datasets that have been held out during the parameter search,there is no need to test all six variants on the subset of those datasets thatcontain no categorical features. But a choice still needs to be made for thelayer factor parameter. One that is representative of the parameter searchresults. As three of the six variants report that the best layer factor is k = 2when heterogeneous datasets are taken into consideration, one of these vari-ants importantly being the baseline Autoencoder with MSE loss, we land onthis value of k = 2 for the layer factor when we evaluate with the held outdatasets.

    23

  • 6 EXPERIMENTS & EVALUATION

    Table 2: Candidate values of hyperparameters for each of the models. Valuesin boldface are the best configurations tested. For those models with twoparameters, each possible combination of the listed values was tested.

    Model Hyperparameters Candidate values

    Isolation Forest #Decision trees 50, 75, 100, 125,150

    One Class SVM ν 0.001, 0.01, 0.1,0.5

    Local Outlier Factor #Neighbors 10, 20, 30, 40, 50Autoencoder, MSE Layer factor 2, 3, 4, 5Autoencoder, MSE/Cross-entropy Layer factor k,

    Loss weight wk = (2, 3, 4),w = (1.0,1.5)

    Autoencoder, MSE/Softmax Layer factor k,Loss weight w

    k = (2, 3,4),w = (1.0, 1.5)

    Aux. Embed. Autoencoder, MSE Layer factor 2, 3, 4, 5Aux. Embed. Autoencoder,MSE/Cross-entropy

    Layer factor k,Loss weight w

    k = (2, 3, 4, 5),w = (1.0)

    Entity Embed. Autoencoder Layer factor 2, 3, 4, 5

    6.2 Dataset Distribution

    The parameter search was conducted to find optimal hyperparameter settingsfor each of the models. But the results also allow us to inspect the benchmarkitself. If we assume that all of the hyperparameter configurations are at leastreasonable, we can aggregate the performance of those configurations on eachindividual dataset instance to investigate the benchmark.

    Figures 4 and 5 show the effects of point difficulty and relative anomalyfrequency on the AUROC achieved by models. AUROC is used because it is abetter metric for the difficulty of a dataset than accuracy. In addition to thisthe accuracy will generally be higher when the relative anomaly frequency islower, which is not the case for AUROC. Figure 4 shows an obvious downwardtrend in AUROC as anomalous points are taken from increasingly difficultsubsets. This was also observed by Emmott et al.1 The effect of relativefrequency on AUROC, on the other hand, seems to be much less pronouncedor altogether absent. Recall that there are no anomalies in the training dataand the relative frequency only affects the test set. Therefore it is expectedthat different values of relative anomaly frequency do not affect the AUROCmeaningfully.

    1Cf. Figure 1 in their paper [8].

    24

  • 6.3 Evaluation on Unseen Datasets

    Figure 4: The effect of point diffi-culty on AUROC.

    Figure 5: The effect of relativeanomaly frequency on AUROC.

    Figure 6 shows the average AUROC achieved by the models for each ofthe datasets used in the hyperparameter search. Most of the datasets havequite low average AUROC, but this is acceptable if we consider that not allhyperparameter configurations perform well. More importantly, there is a lotof diversity in terms of dataset difficulty.

    The datasets that have categorical features are Adult, BankMarketing, Di-abetes, and Mushroom. Considering that three of them are beaten in difficultyonly by the Avila dataset and BankMarketing is not far behind, we observethat the heterogeneous datasets are relatively difficult when compared to thedatasets that only have numerical features. The Mushroom dataset, whichis the only dataset consisting exclusively of categorical features, stands outfurther as the dataset with the most variance between models and instancesby a decent margin. During testing, some model configurations were able toachieve AUROC of 1 for this dataset while some others even performed worsethan random, dropping below 0.5.

    6.3 Evaluation on Unseen Datasets

    After independently determining the best hyperparameter configurations onhalf of the benchmark datasets, we can evaluate the models using those pa-rameters. The other half of the datasets were held out for exactly this purpose.As previously discussed, for the numerical-only datasets only four of the ninemodels need to be evaluated, as the six Autoencoder variants are equivalentin this case. The results for these four models are summarized in table 3. Theresults for all nine models evaluated over the heterogeneous datasets can befound in table 4.

    25

  • 6 EXPERIMENTS & EVALUATION

    Figure 6: Difficulty of each of the datasets used in the hyperparameter search.Error bars are drawn at one standard deviation above and below the mean.

    In terms of AUROC, accuracy, and 1:15 weighted accuracy, Local OutlierFactor appears to be the superior model when it comes to numerical data,but the Autoencoder model is a very close second, especially in terms ofthe two accuracy metrics. OC-SVM shows the best 15:1 weighted accuracy,but this is in stark contrast to its poor performance in terms of the otherthree metrics. We can surmise that compared to other methods, OC-SVM issomewhat overly eager to label datapoints as anomalous, to the detriment ofits overall performance. Isolation Forest does not perform very well, especiallyconsidering it is known as one of the best models in literature. This poorperformance is not surprising, because Isolation Forest is very dependent onthe presence of anomalies in the training data, especially more so than thealternatives. So with a benchmark constructed such that the training datais free of anomalies, it is expected that Isolation Forest does not perform aswell.

    The results for the datasets that contain categorical features tell a similarstory as the numerical datasets. Table 4 shows that while LOF still main-tains the highest AUROC, the baseline Autoencoder model with MSE loss

    26

  • 6.3 Evaluation on Unseen Datasets

    Table 3: Model performance for numerical datasets HTRU2, ISOLET, Mini-BooNE, Musk, and Shuttle. Results are averages over all instances of all fivedatasets. ‘Accuracy 15:1’ means that false positives are weighed fifteen timesas heavy as false negatives and vice versa. (Cf. section 3.1.)

    Model AUROC Accuracy 1:1 Accuracy 15:1 Accuracy 1:15

    Isolation Forest .7426 .9578 .8010 .9803One Class SVM .5710 .9516 .8112 .9715Local Outlier Factor .8087 .9737 .8076 .9971Autoencoder, MSE .7489 .9732 .8060 .9969

    performs the best in all three accuracy metrcs. Unfortunately, none of theother Autoencoder variants perform better than the baseline. Of these fivethe variant with entity embeddings comes closest in terms of AUROC, whilethe Autoencoder with MSE/Softmax loss is the next best variant when itcomes to the accuracy metrics. As was the case for numerical-only datasets,Local Outlier Factor is again quite close to the baseline Autoencoder. Al-though now the other five Autoencoder variants are also competitive. Onthe lower end, neither Isolation Forest nor One Class SVM deals well withthe categorical data, as both models perform worse than all others in all fourmetrics.

    If we consider all of the datasets, it is clear that Local Outlier Factorand Autoencoder with MSE loss are the best tested models. Still unknownis the performance of each of the models on every individual dataset, whichcould lead to some heuristic of which model to choose given some propertiesabout the dataset at hand. Figures 7 and 8 show the average AUROC andaccuracy, respectively, over instances of each of the datasets containing onlynumerical features. Following these are figures 9 and 10 that contain theaverage AUROC and accuracy, respectively, over instances of each of thedatasets that do contain categorical features.

    The first to stand out are the extremely low AUROC values of One ClassSVM forMusk and Shuttle in figure 7, and of multiple Autoencoder models forCovertype, Nursery, and OnlineShoppers in figure 9. Despite poor AUROCvalue, some models still achieve relatively close accuracy to other modelswith higher AUROC. This is the case for all of the Autoencoder variants fordatasets with categorical features in figures 9 and 10. The applications ofLocal Outlier Factor and Autoencoder models to the MiniBooNE instancesare also curious, resulting in low AUROC but relatively high accuracy.

    The reason why the Autoencoder models still obtain relatively high ac-curacy scores despite an AUROC below 0.5 in some cases, is because of thechosen thresholding method. Since datapoints are considered anomalous bythese models only if their score exceeds the score of the highest ranked normal

    27

  • 6 EXPERIMENTS & EVALUATION

    Table 4: Model performance for heterogeneous datasets Covertype, Nursery,OnlineShoppers, and Thyroid. Results are averages over all instances of all fourdatasets. ‘Accuracy 15:1’ means that false positives are weighed fifteen timesas heavy as false negatives and vice versa. (Cf. section 3.1.) ‘AE’ meansAutoencoder.

    Model AUROC Accuracy 1:1 Accuracy 15:1 Accuracy 1:15

    Isolation Forest .6608 .7524 .6649 .7646One Class SVM .6540 .8918 .7868 .9064Local Outlier Factor .8018 .9754 .8192 .9970Autoencoder, MSE .7877 .9778 .8330 .9977Autoencoder, MSE/Cross-entropy .6478 .9749 .8108 .9970Autoencoder, MSE/Softmax .6320 .9752 .8148 .9975Aux. Embed. AE, MSE .6364 .9749 .8103 .9976Aux. Embed. AE, MSE/Cross-entropy .6478 .9749 .8113 .9974Entity Embed. Autoencoder .6702 .9745 .8107 .9972

    point in the training data, the resulting behavior is that almost no points areclassified as anomalous at all. And because anomalies are sparse, accuracy re-mains relatively high. This is particularly true of the four models that reportworse-than-random AUROC for the Nursery dataset. This dataset happensto share the same property of the Mushroom dataset in that it consists exclu-sively of categorical features. Closer inspection of the results for Mushroomreveals that indeed the poor results of the same four models cause the largevariance in AUROC, thus indicating a pattern that these four models as testeddo not work well for such datasets.

    Figure 10 shows the primary reason why the baseline Autoencoder withMSE loss reports higher accuracy than the other models. Despite the ex-tremely similar accuracy for three of the datasets, it far outperforms all othermodels on instances of the Nursery dataset.

    Finally going back to figure 8, the datasets for which Local Outlier Factorreports a meaningfully higher accuracy than Autoencoder are ISOLET andMusk. These datasets share a characteristic that they are very small in size(< 8000) yet have relatively many features (617 and 166 resp.), indicating thatthe Autoencoder model struggles with datasets that have this characteristicin comparison to Local Outlier Factor.

    Everything considered, the two best tested models are Autoencoder withMSE loss, and Local Outlier Factor. This result somewhat echoes the simi-lar finding by Škvára et al. [12]. Both of these models have their strengthsand weaknesses. In general, LOF performs better for numerical-only datasetswhile the Autoencoder is the superior choice when there are categorical fea-

    28

  • 6.3 Evaluation on Unseen Datasets

    tures present. A weakness of the Autoencoder, or conversely a strength ofLOF, lies in the observation that the Autoencoder model works best when athere is comparatively a lot of data available compared to the dimensional-ity of the dataset, whereas the Local Outlier Factor approach still works forsmaller datasets with relatively many features.

    Figure 7: Average AUROC achieved by each model over dataset instances ofeach of the held out numerical-only datasets.

    Figure 8: Average accuracy achieved by each model over dataset instances ofeach of the held out numerical-only datasets.

    29

  • 6 EXPERIMENTS & EVALUATION

    Figure 9: Average AUROC achieved by each model over dataset instances ofeach of the held out datasets containing categorical features.

    Figure 10: Average accuracy achieved by each model over dataset instances ofeach of the held out datasets containing categorical features. Notice that for theNursery dataset, accuracy of Isolation Forest and One Class SVM are missingfrom the figure. These scores are .0241 and .6348, respectively. They are leftout of the figure to better show the more competitive results.

    30

  • 7 Conclusion

    We constructed a benchmark for the evaluation of anomaly detection meth-ods. The benchmark consists of 197 instances of varying difficulty and anomalyfrequency, of 19 real-world datasets. We used this benchmark to evaluate theperformance of the Autoencoder model for anomaly detection, under the con-dition that the initial training data is free of anomalies (also known as noveltydetection). The Autoencoder model was compared to the three other well-known methods Isolation Forest, One Class SVM, and Local Outlier Factor.

    Two new loss functions were defined to train the Autoencoder in the con-text of heterogeneous data, where at least some of the data consists of cat-egorical features. Furthermore two additional methods were used to embedthe categorical data in a numerical space before feeding into the Autoencoder.These methods are entity embeddings and auxiliary embedding neural net-works. The various loss functions in combination with the two embeddingmethods result in a total of six variants of the Autoencoder model to becompared to the other three models.

    Of the 19 datasets in the benchmark, 10 were used to perform a hyperpa-rameter search for each of the 9 models, leaving us with the best performingconfiguration for each of the models. The remaining 9 datasets were used toevaluate the models. Isolation Forest and One Class SVM are not able to per-form on the same level as the Autoencoder and Local Outlier Factor models.Of the six Autoencoder variants, the baseline using a Mean Squared Errorloss function with no additional embedding of categorical features performsbest. All of the other variants, except the variant using entity embeddings,perform especially poorly for datasets consisting exclusively of categoricalfeatures. The Autoencoder with MSE loss and Local Outlier Factor modelshave similar performance, with Local Outlier Factor having a slight edge fordatasets with no categorical data, and Autoencoder being the best testedmodel for datasets that do contain categorical features.

    31

  • 8 DISCUSSION & FUTURE WORK

    8 Discussion & Future Work

    The conclusion that the Autoencoder with MSE loss is only slightly betterthan Local Outlier Factor, and moreover, the conclusion that none of the othervariants were able to outperform the baseline Autoencoder, are somewhatdisconcerting. And there are multiple reasons possible as to why this is thecase that do not involve them simply being inferior.

    First, AUROC below 0.5 occurs multiple times in the results for someof the Autoencoder variants. But it is not the case across the board: Thevariants are capable to learn the patterns in other datasets. This warrants aninvestigation as to why these models are not able to learn the patterns in thoseparticular datasets where they fall flat. Such an investigation would likelyinvolve extending the benchmark with datasets of similar characteristics.

    Second, the shape of the underlying autoencoding neural network wasdetermined heuristically. It could be the case that this shape works well forthe baseline Autoencoder, but not so much for the other variants. This issupported by the fact that the parameter search resulted in differing valuesfor the layer factor parameter. In future work, it could be interesting to testother Autoencoder shapes and their effect on the various loss functions andembedding methods.

    Further future work includes the addition of more datasets in general tomake more conclusions with higher confidence. Also, a seventh Autoencodervariant that was left out could be included, namely the Autoencoder modelwith auxiliary networks using MSE/Softmax loss. Although the results ofthe variants other than the baseline Autoencoder do not bode well for thisseventh candidate, it could still be carried out in full.

    Finally, there are still more data types than numerical and categorical.An interesting challenge is to further modify the Autoencoder to allow forexample time-series and body-of-text components within the same dataset.

    32

  • REFERENCES

    References

    [1] Douglas M Hawkins. Identification of outliers. Springer, 1980. 4

    [2] Raghavendra Chalapathy, Aditya Krishna Menon, and San-jay Chawla. Anomaly detection using one-class neural net-works. arXiv preprint arXiv:1802.06360, 2018. 4, 21

    [3] Simon Hawkins, Hongxing He, Graham Williams, and RohanBaxter. Outlier detection using replicator neural networks. InInternational Conference on Data Warehousing and Knowledge Discov-ery, pages 170–180. Springer, 2002. 4

    [4] Cubonacci: Machine learning lifecycle management, 2020. 5

    [5] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, andJörg Sander. LOF: identifying density-based local outliers.In Proceedings of the 2000 ACM SIGMOD international conference onManagement of data, pages 93–104, 2000. 6, 20

    [6] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolationforest. In 2008 Eighth IEEE International Conference on Data Mining,pages 413–422. IEEE, 2008. 6, 20, 21

    [7] Bernhard Schölkopf, John C Platt, John Shawe-Taylor,Alex J Smola, and Robert C Williamson. Estimating thesupport of a high-dimensional distribution. Neural computation,13(7):1443–1471, 2001. 6, 21

    [8] Andrew F Emmott, Shubhomoy Das, Thomas Dietterich,Alan Fern, and Weng-Keen Wong. Systematic construction ofanomaly detection benchmarks from real data. In Proceedings ofthe ACM SIGKDD workshop on outlier detection and description, pages16–21, 2013. 6, 7, 9, 24

    [9] Varun Chandola, Arindam Banerjee, and Vipin Kumar.Anomaly detection: A survey. ACM computing surveys (CSUR),41(3):1–58, 2009. 7

    [10] Raghavendra Chalapathy and Sanjay Chawla. Deep learningfor anomaly detection: A survey. arXiv preprint arXiv:1901.03407,2019. 7

    [11] Guilherme O Campos, Arthur Zimek, Jörg Sander, Ri-cardo JGB Campello, Barbora Micenková, Erich Schubert,

    33

    https://cubonacci.com

  • REFERENCES

    Ira Assent, and Michael E Houle. On the evaluation of unsu-pervised outlier detection: measures, datasets, and an empiri-cal study. Data mining and knowledge discovery, 30(4):891–927, 2016.7, 9, 13

    [12] Vít Škvára, Tomáš Pevnỳ, and Václav Šmídl. Are generativedeep models for novelty detection truly better? In SIGKDDODD v5.0 Workshop, 2018. 7, 28

    [13] Amol Ghoting, Matthew Eric Otey, and SrinivasanParthasarathy. Loaded: Link-based outlier and anomaly de-tection in evolving data sets. In Fourth IEEE International Confer-ence on Data Mining (ICDM’04), pages 387–390. IEEE, 2004. 8

    [14] Ke Zhang and Huidong Jin. An effective pattern based outlierdetection approach for mixed attribute data. In Australasian jointconference on artificial intelligence, pages 122–131. Springer, 2010. 8

    [15] John T Hancock and Taghi M Khoshgoftaar. Survey on cate-gorical data for neural networks. Journal of Big Data, 7:1–41, 2020.8

    [16] Cheng Guo and Felix Berkhahn. Entity embeddings of cate-gorical variables. arXiv preprint arXiv:1604.06737, 2016. 8, 14

    [17] Ting Chen, Lu-An Tang, Yizhou Sun, Zhengzhang Chen, andKai Zhang. Entity embedding-based anomaly detection for het-erogeneous categorical events. arXiv preprint arXiv:1608.07502,2016. 8

    [18] Tung Kieu, Bin Yang, and Christian S Jensen. Outlier detec-tion for multidimensional time series using deep neural net-works. In 2018 19th IEEE International Conference on Mobile DataManagement (MDM), pages 125–134. IEEE, 2018. 8

    [19] Dheeru Dua and Casey Graff. UCI Machine Learning Repos-itory, 2017. 9

    [20] Yann LeCun and Corinna Cortes. The MNIST database ofhandwritten digits. 2010. 9

    [21] Christopher KI Williams and Matthias Seeger. Using theNyström method to speed up kernel machines. In Advances inneural information processing systems, pages 682–688, 2001. 10

    [22] Usama Fayyad and Keki Irani. Multi-interval discretization ofcontinuous-valued attributes for classification learning. In IJCAI,pages 1022–1027, 1993. 14

    34

    http://archive.ics.uci.edu/mlhttp://archive.ics.uci.edu/mlhttp://yann.lecun.com/exdb/mnist/http://yann.lecun.com/exdb/mnist/

  • REFERENCES

    [23] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.Efficient estimation of word representations in vector space.In 1st International Conference on Learning Representations (ICLR),Workshop Track Proceedings, 2013. 14

    [24] Sarah M Erfani, Sutharshan Rajasegarar, ShanikaKarunasekera, and Christopher Leckie. High-dimensionaland large-scale anomaly detection using a linear one-class SVMwith deep learning. Pattern Recognition, 58:121–134, 2016. 21

    [25] Tsatsral Amarbayasgalan, Bilguun Jargalsaikhan, andKeun Ho Ryu. Unsupervised novelty detection using deepautoencoders with density based clustering. Applied Sciences,8(9):1468, 2018. 21

    [26] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-peau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine Learning in Python. Journal of Machine Learning Research,12:2825–2830, 2011. 23

    [27] Martín Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, AndyDavis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, IanGoodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, ManjunathKudlur, Josh Levenberg, Dandelion Mané, Rajat Monga,Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal-war, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan,Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wat-tenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Ten-sorFlow: Large-Scale Machine Learning on Heterogeneous Sys-tems, 2015. Software available from tensorflow.org. 23

    35

    https://www.tensorflow.org/https://www.tensorflow.org/https://www.tensorflow.org/

    1 Introduction1.1 Cubonacci1.2 Overview

    2 Related Work2.1 Evaluation of Anomaly Detection2.2 Heterogeneous Data

    3 Method3.1 Evaluation Metrics

    4 Categorical features5 Models5.1 Modifying Autoencoders for Heterogeneous Data5.2 Alternative Anomaly Detection Methods5.3 Anomaly Score & Thresholding

    6 Experiments & Evaluation6.1 Parameter Search6.2 Dataset Distribution6.3 Evaluation on Unseen Datasets

    7 Conclusion8 Discussion & Future WorkReferences