anomaly detection in categorical data with interpretable ...1330907/fulltext01.pdf · learning from...

58
Linköpings universitet SE– Linköping + , www.liu.se Linköping University | Department of Computer and Information Science Bachelor’s thesis, 15 ECTS | statistics LIU-IDA/STAT-G--19/001–SE Anomaly Detection in Categorical Data with Interpretable Machine Learning A random forest approach to classify imbalanced data Ping Yan Supervisor : Annika Tillander Examiner : Linda Wänström External supervisor : David Asgrimsson

Upload: others

Post on 27-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

Linköpings universitetSE–581 83 Linköping

+46 13 28 10 00 , www.liu.se

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 15 ECTS | statistics

LIU-IDA/STAT-G--19/001–SE

Anomaly Detection in Categorical Datawith Interpretable Machine Learning– A random forest approach to classify imbalanced data

Ping Yan

Supervisor : Annika TillanderExaminer : Linda Wänström

External supervisor : David Asgrimsson

Page 2: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

Abstract

Metadata refers to "data about data", which contains information needed to understand theprocess of data collection. In this thesis, we investigate if metadata features can be usedto detect broken data and how a tree-based interpretable machine learning algorithm canbe used for an effective classification. The goal of this thesis is two-fold. Firstly, we applya classification schema using metadata features for detecting broken data. Secondly, wegenerate the feature importance rate to understand the model’s logic and reveal the keyfactors that lead to broken data.

The given task from the Swedish automotive company Veoneer is a typical problem oflearning from extremely imbalanced data set, with 97 percent of data belongs healthy dataand only 3 percent of data belongs to broken data. Furthermore, the whole data set con-tains only categorical variables in nominal scales, which brings challenges to the learningalgorithm. The notion of handling imbalanced problem for continuous data is relativelywell-studied, but for categorical data, the solution is not straightforward.

In this thesis, we propose a combination of tree-based supervised learning and hyper-parameter tuning to identify the broken data from a large data set. Our methods arecomposed of three phases: data cleaning, which is eliminating ambiguous and redundantinstances, followed by the supervised learning algorithm with random forest, lastly, weapplied a random search for hyper-parameter optimization on random forest model.

Our results show empirically that tree-based ensemble method together with a randomsearch for hyper-parameter optimization have made improvement to random forest per-formance in terms of the area under the ROC. The model outperformed an acceptableclassification result and showed that metadata features are capable of detecting brokendata and providing an interpretable result by identifying the key features for classificationmodel.

Keywords: machine learning, supervised learning, decision tree, imbalanced data,anomaly detection, categorical variable, ensemble method, random forest

Page 3: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

Sammanfattning

Metadata avser ”data om data”, som innehåller information som behövs för att förståprocessen för datainsamling. Denna studie undersöker om metadatavariablerna kan an-vändas för att upptäcka trasiga data och hur en träd-baserad tolkbar maskininlärningsal-goritm kan användas för en effektiv klassificering. Syftet med denna studie är tvåfaldig,för det första tillämpas ett klassificeringsschema som använder metadatavariabler för attidentifiera förstörda data. För det andra, försöker modellens logik förstås genom att un-dersöker vilka variabler som spelar de största rollerna i klassfikationen.

Datasetet från uppdragsgivaren Veoneer är ett typiskt problem med att lära sig frånett extremt obalanserat dataset, med 97 procent av data som tillhör hälsosam data ochbara 3 procent av data tillhör trasiga data. Dessutom innehåller hela datasättet endastkategoriska variabler i nominella skalor, vilket medför utmaningar till inlärningsalgo-ritmen. Begreppet hantering av obalanserat problem för kontinuerliga data är relativtvälstudierad, men för kategoriska data är lösningen inte okomplicerad.

I denna studie föreslås en kombination av trädbaserad övervakad inlärning och hyperpa-rameters finputsning för att identifiera de trasiga data från en stor dataset. Våra metoderbestår av tre faser: datarensning, att eliminerar tvetydiga och överflödiga observationer,följt av övervakad inlärningsalgoritm med Random Forest. Slutligen tillämpades en ran-dom search efter hyperparameteroptimering på Random forest modellen.

Resultat visar empiriskt att trädbaserad ensemblemetod tillsammans med en randomserach för hyperparameteroptimering har förbättrat Random forest modellen i termerav området under ROC. Modellen överträffade ett acceptabelt klassificeringsresultat ochvisade att metadatafunktioner kan upptäcka brokendata och ge ett tolkningsbart resultatgenom att identifiera nyckelfunktionerna för klassifikationsmodellen.

Nyckelord: maskininlärning, övervakad inlärning, beslut träd, obalanserat data, anoma-litetsdetektering, kategorisk variabel, ensemble metoder, random forest

Page 4: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

Machine Learning Terminology

Classifier A classification technique that buildsclassification models from data set

Data set A collection of data objects

Feature, or VariableCharacteristics that describes a data object.In statistics the feature are often named asattribute, characteristic, field or dimension

Generalization error The number of misclassification errorscommitted on the test set

Inputs In statistical literature the inputs are often calledthe predictors, explanatory or independent variables

Instance An instance is a single data object, also calleda record, or an observation

Iteration The repeated application of learning algorithmon data set

Outputs The outputs in statistics are called the responses,target, or classically the dependent variables

Test set The part of data set used to test the model,consist of records whose class labels are unknown

Training set The part of data set used to train the model,consist of records whose class labels are known

iv

Page 5: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

Acknowledgments

First and foremost, I would like to thank David Asgrimsson and team Se7en at Veoneer forgiving me the opportunity and resources needed to investigate the subject of this thesis. Thisexperience within RA simulation really contributed to shape my version of data engineeringin the automotive industry. I would also like to extend a sincere gratitude and great thankto my supervisor Annika Tillander at Linköping University, for all the invaluable advice andsupport in both theoretical and practical questions throughout this work.

I also want to thank examiner Linda Wänström who has guided and helped me with thethesis, and opponent Min Liu for the constructive comments and insights. My special thanksgo to Isak Hietala, who offered me a position to this program three years ago, as the teacherfor my first statistics course, he was the person who initiated me to the world of logicalthinking, moreover, it was through his data mining course that I developed my interest inMachine Learning and decided to carry out a thesis in this field. I am grateful for all yourhelp and thank you all.

v

Page 6: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

Contents

Abstract ii

Sammanfattning iv

Machine Learning Terminology iv

Acknowledgments v

Contents vi

List of Figures viii

List of Tables ix

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Social and ethical aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 Software and programming languages . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Data 52.1 Data profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Data preparing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Method 183.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2 Learning with imbalanced data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Evaluating the performance of a classifier . . . . . . . . . . . . . . . . . . . . . . 213.4 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.6 hyper-parameter optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.7 Feature importance rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.8 Modelling process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Results 344.1 AUC-optimized random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2 Accuracy-optimized random forest . . . . . . . . . . . . . . . . . . . . . . . . . . 364.3 Feature importance rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

vi

Page 7: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

5 Discussion 425.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusion 46

Bibliography 48

vii

Page 8: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

List of Figures

2.1 metadata in JSON format: an example . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Data cleaning in steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Levels of all variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Proportion of missing values inside all variables . . . . . . . . . . . . . . . . . . . . 102.5 Class distribution of variable 1 to 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Class distribution of variable 9 to 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.7 Class distribution of variable 17 to 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.8 Scatter-plot between class frequency and distribution of broken data . . . . . . . . 142.9 One-hot encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Classification task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Classification workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Class distribution in target variable y . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Receiver operating characteristic (ROC) for binary classification . . . . . . . . . . . 233.5 Five-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.6 A decision tree for the example data . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.7 Random Forest workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.8 Modelling workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 ROC curve for AUC-optimized model . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 ROC curve for Recall-optimized model . . . . . . . . . . . . . . . . . . . . . . . . . 384.3 Feature importance rate for AUC-optimized model . . . . . . . . . . . . . . . . . . 394.4 Feature importance rate for accuracy-optimized model . . . . . . . . . . . . . . . . 404.5 Grouped feature importance rates for accuracy-optimized model . . . . . . . . . . 40

viii

Page 9: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

List of Tables

2.1 Variable description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Distribution of broken and healthy data . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Ambiguous data, an example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Class distribution after removing ambiguous instances . . . . . . . . . . . . . . . . 152.5 Variable description after removing ambiguous data . . . . . . . . . . . . . . . . . . 16

3.1 Confusion matrix for for a binary classification problem . . . . . . . . . . . . . . . 213.2 Example data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Splitting nomial attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4 Description of hyper-parameters for random forest classifier in Scikit-learn 0.20.3 . 303.5 Hyper-parameter space settings for random search . . . . . . . . . . . . . . . . . . . 32

4.1 Hyper-parameters that maximizes the AUC score . . . . . . . . . . . . . . . . . . . 344.2 Confusion matrix from AUC-optimized model . . . . . . . . . . . . . . . . . . . . . 354.3 Precision and recall from the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4 Hyper-parameter space for the accuracy-optimized model . . . . . . . . . . . . . . 374.5 Confusion matrix from accuracy-optimized model . . . . . . . . . . . . . . . . . . . 374.6 Precision and recall from the accuracy-optimized model . . . . . . . . . . . . . . . . 38

5.1 Distribution of broken and healthy data after resampling . . . . . . . . . . . . . . . 43

ix

Page 10: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

1 Introduction

The amount of data available today is increasing exponentially due to technology advance-ments in data collection and storage. Data brings valuable information and provide insightfulknowledge for analysts. For tech industry who relies on data to guide the product-developingprocess, the data collection has often been done at high cost and of high importance to com-pany’s success. The quality and usability of the collected data directly influences the devel-opment of the key product.

The focus of this thesis is the classification of broken data, the data who does not meet certainquality requirements, from well-functioning healthy data. In recent years, machine learning hasemerged as one of the most popular and effective approaches to solve classification problems.This thesis presents a study of a tree-based machine learning algorithm to detect broken databased on a large set of information collected about the data.

The approach of this study will not only be evaluating the model’s performance, but alsointerpreting the results by accessing variable importance measures. The interpretation of themodel will be able to help the the data collector understand which factor have significantimpact on data quality, and eventually reduces the cost of collecting low quality data.

1.1 Background

This thesis was performed at Veoneer in Linköping who also suggested the subject of thethesis. Veoneer is the world’s largest pure-play company focused on automotive electronicssafety technologies, specially Advanced Driving Assistance Systems (ADAS), a system to as-sist driver during driving to increase safety for everyone on the road, and Automated Driving(AD), the self-driving transport without human intervention (Inc, 2019).

In Veoneer Linköping, the key focus is developing a camera based version system. Thisactive safety system uses object detection to identify objects in front of the vehicles through acamera, and helps the driver to make decisions during emergencies to avoid accident (Rosell,2015).

1

Page 11: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

1.2. Objectives

Data is one of the most valuable resources for developing this vision system. Veoneer hasbeen collecting data worldwide with test vehicles equipped with camera systems, those cam-eras scanned the vehicle’s surroundings for danger and stored the information. The collecteddata is then archived on a database for training object detection algorithms, so that the cam-era based active safety system can learn from the data to continuously identify and trackthe potential dangerous objects around the vehicle to warn the driver, and eventually takenecessary actions when the car is in danger of a collision (Autoliv, 2018).

Until 2018 Veoneer has already collected more than 15 petabytes of data from more than 100test vehicles on the road (Veoneer, 2018). Among the huge volume of data that has beencollected, issues around the quality of the collected data started to come up. Broken data whodoes not meet the quality requirements has been some problem and resulted in delaying theproduct development process. Therefore, a genuine interest in detecting the broken data fromthe healthy data, and understanding the reason behind has arisen. This thesis was initiatedupon this interest.

1.2 Objectives

The aim of the study is to build a tree based classification model to predict if the collecteddata are broken or not by using machine learning algorithms for classification, the study alsoseeks to use an explainable machine learning algorithm, a model that will be able to explainthe logic behind classification and reveal which feature have the most impact.

1.3 Problem formulation

This thesis is intended to answer the following questions:

1. Can a classification model automatically detect broken files from healthy files, by learn-ing only from metadata features?

2. Which features contribute most to the classification model, in order words, which inputfeatures are most important in determining the target variable?

1.4 Related work

The pursuit of interpretability in machine learning models has been an active research fieldin recent years (Vellido, Martin-Guerrero, and Lisboa, 2012). While digitization brings ever-increasing amounts of information, models using machine learning algorithms have beenwidely applied to handle data of different levels of complexity and diversity. The obtainedmodels are not only meant to bring a correct answer to pattern recognition problems, but alsomeant to provide an interpretable description of how the answer is made, such as identity asubset of the variables with the maximal explanatory power (Goodacre, 2003).

Intrepretability is a necessary requirement in numerous application fields, as experts requirea clearly explainable basis from the models for their decision making tasks (Vellido, Martin-Guerrero, and Lisboa, 2012). The lack of explainability results in the lack of trust, leadingto industry reluctance to adopt or deploy software analytics (Dam, Tran, and Ghose, 2018).Vellido et al. (2012) believe that intrepretability is a paramount quality that machine learningmethods should aim to achieve if they are to be applied in practice.

One effective way of achieving explainability is choosing a simpler model, according to theOccam’s Razor principle of parsimony, a model needs to be expressed in a simple way that is

2

Page 12: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

1.5. Social and ethical aspects

easy to interpret. Simple models such as decision tree, classification rules, and linear re-gression tend to be more explainable than complex models such as deep neural networks,SVMs or ensemble methods (Dam, Tran, and Ghose, 2018). However, using simple modelsimproves explainability but requires a sacrifice for accuracy. Dam et al.(2018) suggest thatfuture research should explicitly address the trade-off between explainability and predictionaccuracy.

Vellido et al. (2012) focus on the theme of dimensionality reduction (DR) as an efficient approachto model interpretation in machine learning, with two main approaches: feature selection, inwhich features are appraised individually in order to either retain or discard them, and featureextraction, in which new features are created on the basis of the original ones, such as Prin-cipal Component Analysis. By combining DR with a machine learning algorithm, to createa compound method such as Single Layer Perceptron, researchers managed to achieve highaccuracy and interpretability in the model.

Another popular direction of research is making black-box models, such as artificialneural networks (ANNs) more explainable. Goodacre (2003) applied en evolutionarycomputational-based algorithm, genetic programming (GP) to make models from ANNs morecomprehensible. GP takes the concepts of Darwinian selection and uses a tree structure togenerate the desired output. Tsuchiya and Fujiyoshi (2006) describes a method to evaluatingthe feature importance for object type classification by applying AdaBoost to choose a goodsubset of the features, they define a contribution ratio for each feature and measure the cor-relation between contribution ratio and the classification performance in ANN to find outwhich feature should be chosen.

However, interpreting black-box models is beyond the scope of this thesis. The focus of thisthesis is applying a tree-based ensemble method for solving a binary valued classificationproblem. Tree-based methods in machine learning, also known as ensemble methods withdecision tree as the base algorithm, stand as one of the most effective and useful method, capa-ble to produce both reliable and understandable results on almost any kind of data. Randomforest, a tree-based ensemble classification method, have become a major data analysis toolused with success in various scientific areas because of their capability to build accurate mod-els and provide variable importance measures (Louppe, 2014).

Jiang et al. (2007) apply a Random Forest (RF) prediction model with combined features forclassification between real and pseudo pre-miRNAs, result shows that the total predictionaccuracy of the RF method was nearly 10 percent greater than the existing method, Triplet-SVM-classifier. Further analysis shows that RF algorithm is one the reasons behind this accu-racy improvement.

There are various machine learning libraries using implemented RF algorithm in differ-ent programming languages. Examples of popular libraries include Scikit-Learn package inPython and randomForest package in R, both are capable of providing variable importancemeasures (Louppe, 2014). Furthermore, Boruta package in R introduces a novel RF basedfeature selection method, which provides unbiased and stable selection of important andnon-important attributes from a data set based on a statistical test (Kursa, Rudnicki, et al.,2010).

1.5 Social and ethical aspects

The original data set contains descriptive information about data collection, including systemand environment configuration, and one variable driver that indicates who was the driver ofthe test vehicles. This variable is renamed for analysis in this thesis, so that no identifying

3

Page 13: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

1.6. Delimitation

values that can link the information to any individual. A confidentiality agreement has beenmade between the author of this thesis and Veoneer, which includes that the author is notallowed to leave the data in any form accessible to unauthorized party.

Furthermore, including driver variable in the analysis does not mean to blame any individualfor their mistakes, but rather to treat this variable as a whole, so that this thesis can present ageneral overview to see if human error is a possible cause of broken data, and also providereflective thinking on the instruction for the future data collector.

1.6 Delimitation

As advised from the taskmaster, the study is delimited to the analysis of preselected features.Features that involves environment description (time and place) have been removed fromdata set.

1.7 Software and programming languages

This thesis uses Python and R as the main programming languages. The data set was down-loaded from an internal database and saved in a CSV file. Python was used for data import-ing, cleaning and modelling. R was used for data visualization. Online diagram softwaredraw.io was used to make flowcharts and process diagrams. Listed following are the mainpackages and libraries used in Python and R:

• dplyr is a R package for data manipulation, used in this thesis for grouping and sum-marizing categorical data by rows or columns, so that the summarized information canbe visualized in graphs.

• ggplot2 is a data visualization package for R to create and modify graphs. Used to createscatter plots and bar charts in this thesis.

• matplotlib is a visualization library for Python, used to generate the ROC plot.

• Numpy is a fundamental package for scientific computing in Python. Used in this thesisfor numerical calculations.

• Pandas is a Python library that provided various tools for data manipulation and dataanalysis. Used in this thesis for data cleaning and data transformation.

• Scikit-learn or sklearn is a machine learning library for Python, it provides solid imple-mentations of a wide range of machine learning algorithms. The data modelling pro-cess, parameter tuning and feature importance rate extracting were carried out withfunctions in sklearn version 0.20.3.

4

Page 14: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2 Data

This chapter describes the data processing in this thesis in detail, the data processing is di-vided into three stages:

1. Data profiling: the process of reviewing raw data, understanding data structure, con-tent and types of variables.

2. Data preparing: the process of cleaning up raw data to remove irrelevant and duplicatefeatures.

3. Data processing: the process of processing data into usable and desired form for Ma-chine Learning algorithms.

2.1 Data profiling

This thesis works with a large data set that consists of approximately 1.5 million observa-tions, all variables are nominal variables with at least one categories. The data was collectedbetween June 2017 and January 2019. The main data collection process were carried out intest vehicles equipped with specific-designed data collection tools. However, the collecteddata from test vehicles are not the study subject for this thesis. The information contained inthis data set is referred to as metadata.

Metadata

Metadata are a structured information that describes, explains, or locates an information re-source. It is often called data about data or information about information. An importantreason for creating metadata are to facilitate discovery and identification of relevant data.Metadata helps support archiving and preservation of the original data, users will be able todetail data’s physical characteristics and track its behavior for future reference (Riley, 2017).

The metadata in this thesis describes the data collection configuration, with a main focus onhardware and software systems, so the user will be able to discover, for example, from whichtype of hardware that data was collected, which software system was used when collecting

5

Page 15: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.1. Data profiling

data and so on. Those metadata information are configured in an internal data collection toolfor every data collection trip, the metadata are then labeled online by data collectors (drivers),and extra metadata are labeled offline by data markers.

Dat file

The data collection objective is to collect image data that captures the street scene in front ofthe vehicles. An internal data collection tool combines information from cameras, radars andother sensors that was set on test vehicles altogether to gather the actual road situation. Thecollected data comes in every 30 seconds and stored in .dat format with a size of approxi-mately 4 GB per file. Every .dat file has a corresponding file containing metadata about theenvironment and the system information.

Those dat file later could be used to re-create the situation in test environment or as input totests for training the object detection algorithm. The camera system learns from the algorithmto read and discover different objects on the road, including pedestrians, children and cyclists,cars and trucks, lane markings, worldwide traffic signs and road edges (Autoliv, 2018).

Broken data

Broken data are defined as the .dat file who failed in the validation testing. The validationtesting is a process of evaluation to control if the object data actually meet certain qualitycharacteristics. Testing is carried out through an internal validating software program.

JSON format

The metadata was configured using JSON format. JSON stands for “JavaScript Object No-tation”, it is a simple data interchange format that is based on JavaScript (Droettboom et al.,2015). JSON can represent all kinds of structured data in a language that is easily interchange-able with other programming languages, such as Python and R.

The basic object structure in JSON looks as follows (Droettboom et al., 2015):

Listing 2.1: object structure in JSON

{ " key " : " value1 " ," key2 " : " value2 " }

JSON support data structures such as object, array, number, string, boolean and null. Allkinds of structured data can be represented by these simple data types (Droettboom et al.,2015).

When it comes more complex object, it is really useful to structure the JSON schema intoparts that can facilitate the processing. It will also be easier to extract necessary informationfrom the whole file. An example of metadata in JSON format is presented in Listing 2.2, thepresented data are fictitious.

6

Page 16: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.2. Data preparing

Listing 2.2: Metadata in JSON format, an example

{" index " : 1 ," i n f o " : {

" system " : "XX 123456 " ," record ing_t ime " : " 2018 12 31 12 : 00 : 00 " ," p r o p e r t i e s _ l i s t " : {

" hardware " : " TypeA " ," sof tware " : " TypeB " ," d r i v e r " : "A" ," car " : " 123456 "

}}

}

Listing 2.2 shows a nested JSON format with parent-keys and sub-keys. The search of infor-mation for "car" in this example is carried out in the following steps:

1. go to the JSON file

2. find the value of the key "info"

3. within that object, find the value of the key "properties_list"

4. within that object, find the value of the key "car"

The internal data collection tool uses those steps to know what information to write in whichpart of the JSON file without modifying the whole file or affecting other parts. The keysin JSON file was decided by data collection manager and values was entered by the datacollector. The structure of this metadata are visualized in figure 2.1.

index info

system recording_time properties_list

hardware software driver car

Figure 2.1: metadata in JSON format: an example

As it presented in figure 2.1, the metadata shows a hierarchical structure in a form of trees.When it comes to data analysis, the JSON file was converted to a flat structure, a two dimen-sional data frame in Python for computational modelling.

2.2 Data preparing

Raw data that was downloaded from the internal database contains approximately 1.5 mil-lion observations (rows) and 45 variables (columns). A process of data cleaning has beenimplemented as there exists 4 major problems in raw data:

7

Page 17: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.2. Data preparing

• Large amount of missing values in several columns, due to the updating of softwareand hardware systems, for example, data collected with a new software system doesn’tnot contain configuration information for the old software system.

• Duplicate columns, due to different versions of data collection tool, for example, Vehicleidentification number is named as "vin" in the old data collection tool and "VIN" in thenew data collection tool, which leads to the raw data contains both "vin" and "VIN"columns that have exactly the same information.

• Duplicate information in each column, due to the manual entry of metadata by datacollector, for example, date can be written in form of "19-01-01", or "(190101)" and"19_01_01".

• Unnecessary columns which contain information that is not relevant for the purpose ofdata analysis, or have no importance at all as suggested by the experts, for example,columns "time of the day", "country code" and "comments".

Data cleaning has been done in the following steps:

Original data set: 45 columns 

Remove irrelevant columns:6 columns removed 

Remove empty columns(Columns with 100% missing values):

9 columns removed 

Remove constant columns(Columns with one value for all

and no missing values) :2 columns removed 

Remove duplicate columns (Merge the values to another column):

3 columns removed 

25 columns leftGrouping values inside each column to

minimize duplicate information    

Figure 2.2: Data cleaning in steps

Removing the irrelevant columns were carried out under the expert guidance of data collec-tion director. Whereas removing empty columns, columns with one constant value, mergingcolumns were performed through functions written in Python. Redundant and irrelevantfeatures in data set can reduce classification accuracy in learning algorithm. Although ma-chine learning algorithm can be modified to handle missing values and redundant attributes,it is still considered to be a good approach to select relevant data objects and attributes for the

8

Page 18: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.3. Descriptive statistics

analysis. In this case the goal is to improve the learning algorithm with respect to computa-tional time, cost, and quality (Tan et al., 2014).

2.3 Descriptive statistics

Table 2.1 presents a brief description of variables. Levels indicates the number of categoriesinside each variable, missing values count as a category if they exist. There are two maintypes of explanatory variables in this data set, "sys_", refers to system variables that concernsthe general environment settings, for example, which data collection tool has been used. And"prop_", which refers the properties variables, those are the configuration information regard-ing the purpose of the data collection, for example, vehicle configuration, camera or networkconfiguration.

Table 2.1: Variable description

Variables Levels Proportion of missing values1 sys_1 11 02 sys_2 19 03 prop_1 34 0.0157064 prop_2 4 0.8229665 prop_3 2 0.9526356 prop_4 17 0.6139587 prop_5 5 0.9859468 prop_6 8 0.0465459 prop_7 7 0.035582

10 prop_8 4 0.00998311 prop_9 6 0.00998312 prop_10 3 0.06483213 prop_11 61 0.01000114 prop_12 10 0.01956215 prop_13 2 0.01165916 prop_14 3 0.00998317 prop_15 21 0.40960318 prop_16 2 0.96856619 prop_17 2 0.01165920 prop_18 20 0.01165921 prop_19 29 0.00998222 prop_20 6 0.00998223 prop_21 2 0.99828324 prop_22 40 0.00008425 validation 2 0

Several characteristics about the data can be read from Table 2.1. First, number of categoriesinside each variable differ from one another, variable prop_11 has the most categories with 61levels in total including missing values, while several variables have only 1 category exclud-ing missing values. Second, the amount of missing values inside each variable has a greatdifference, 4 variables have missing values over 90 percent while the majority of variables’missing values are less than 5 percent. Furthermore, there is no connection between amountof categories and proportion of missing values inside each variable, as shown in Figure 2.3and Figure 2.4. The missing values shown in Figure 2.4 are proportion of missing valuesinside each column.

9

Page 19: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.3. Descriptive statistics

0

20

40

60

sys_

1

sys_

2

prop

_1

prop

_2

prop

_3

prop

_4

prop

_5

prop

_6

prop

_7

prop

_8

prop

_9

prop

_10

prop

_11

prop

_12

prop

_13

prop

_14

prop

_15

prop

_16

prop

_17

prop

_18

prop

_19

prop

_20

prop

_21

prop

_22

valid

atio

n

Columns

Levels

Figure 2.3: Levels of all variables

0.00

0.25

0.50

0.75

1.00

sys_

1

sys_

2

prop

_1

prop

_2

prop

_3

prop

_4

prop

_5

prop

_6

prop

_7

prop

_8

prop

_9

prop

_10

prop

_11

prop

_12

prop

_13

prop

_14

prop

_15

prop

_16

prop

_17

prop

_18

prop

_19

prop

_20

prop

_21

prop

_22

valid

atio

nColumns

Missing values

Figure 2.4: Proportion of missing values inside all variables

There are different approaches to handle missing values in Machine Learning. In this thesis,the missing values is simply represented by a new category "NA". Discarding missing valuescould lead to serious depletion of the data set, however, categorizing missing values mighthelp discover if observations with missing values behave differently than those with non-missing values (Friedman, Hastie, and Tibshirani, 2001).

The class frequency, which is number of observations falling in a particular class inside eachcategorical variable also differ from each other. The following figures shows the class distri-bution grouped by healthy data (colored as blue) and broken data (colored as yellow). Herey axis represents the amount of observations in thousands (meaning 400 000 will be labeledas 400).

10

Page 20: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.3. Descriptive statistics

0

100

200

300

400

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

Leve

l 6

Leve

l 7

Leve

l 8

Leve

l 9

Leve

l 10

Leve

l 11

sys_1

False True

0

100

200

300

400

Leve

l 1Le

vel 2

Leve

l 3Le

vel 4

Leve

l 5Le

vel 6

Leve

l 7Le

vel 8

Leve

l 9Le

vel 1

0Le

vel 1

1Le

vel 1

2Le

vel 1

3Le

vel 1

4Le

vel 1

5Le

vel 1

6Le

vel 1

7Le

vel 1

8Le

vel 1

9

sys_2

False True

0

100

200

300

400

500

Leve

l 1Le

vel 2

Leve

l 3Le

vel 4

Leve

l 5Le

vel 6

Leve

l 7Le

vel 8

Leve

l 9Le

vel 1

0Le

vel 1

1Le

vel 1

2Le

vel 1

3Le

vel 1

4Le

vel 1

5Le

vel 1

6Le

vel 1

7Le

vel 1

8Le

vel 1

9Le

vel 2

0Le

vel 2

1Le

vel 2

2Le

vel 2

3Le

vel 2

4Le

vel 2

5Le

vel 2

6Le

vel 2

7Le

vel 2

8Le

vel 2

9Le

vel 3

0Le

vel 3

1Le

vel 3

2Le

vel 3

3N

A

prop_1

0

250

500

750

1000

1250

Leve

l 1

Leve

l 2

Leve

l 3 NA

prop_2

0

500

1000

Leve

l 1 NA

prop_3

0

250

500

750

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

Leve

l 6

Leve

l 7

Leve

l 8

Leve

l 9

Leve

l 10

Leve

l 11

Leve

l 12

Leve

l 13

Leve

l 14

Leve

l 15

Leve

l 16

NA

prop_4

0

500

1000

1500

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4 NA

prop_5

0

250

500

750

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

Leve

l 6

Leve

l 7 NA

prop_6

Figure 2.5: Class distribution of variable 1 to 8

11

Page 21: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.3. Descriptive statistics

0

250

500

750

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

Leve

l 6 NA

prop_7

0

500

1000

1500

Leve

l 1

Leve

l 2

Leve

l 3 NA

prop_8

0

200

400

600

800

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5 NA

prop_9

0

500

1000

Leve

l 1

Leve

l 2 NA

prop_10

0

50

100

150

200

Leve

l 1Le

vel 2

Leve

l 3Le

vel 4

Leve

l 5Le

vel 6

Leve

l 7Le

vel 8

Leve

l 9Le

vel 1

0Le

vel 1

1Le

vel 1

2Le

vel 1

3Le

vel 1

4Le

vel 1

5Le

vel 1

6Le

vel 1

7Le

vel 1

8Le

vel 1

9Le

vel 2

0Le

vel 2

1Le

vel 2

2Le

vel 2

3Le

vel 2

4Le

vel 2

5Le

vel 2

6Le

vel 2

7Le

vel 2

8Le

vel 2

9Le

vel 3

0Le

vel 3

1Le

vel 3

2Le

vel 3

3Le

vel 3

4Le

vel 3

5Le

vel 3

6Le

vel 3

7Le

vel 3

8Le

vel 3

9Le

vel 4

0Le

vel 4

1Le

vel 4

2Le

vel 4

3Le

vel 4

4Le

vel 4

5Le

vel 4

6Le

vel 4

7Le

vel 4

8Le

vel 4

9Le

vel 5

0Le

vel 5

1Le

vel 5

2Le

vel 5

3Le

vel 5

4Le

vel 5

5Le

vel 5

6Le

vel 5

7Le

vel 5

8Le

vel 5

9Le

vel 6

0N

A

prop_11

0

200

400

600

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5

Leve

l 6

Leve

l 7

Leve

l 8

Leve

l 9 NA

prop_12

0

500

1000

1500

Leve

l 1 NA

prop_13

0

250

500

750

1000

Leve

l 1

Leve

l 2 NA

prop_14

Figure 2.6: Class distribution of variable 9 to 16

12

Page 22: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.3. Descriptive statistics

0

200

400

600

Leve

l 1Le

vel 2

Leve

l 3Le

vel 4

Leve

l 5Le

vel 6

Leve

l 7Le

vel 8

Leve

l 9Le

vel 1

0Le

vel 1

1Le

vel 1

2Le

vel 1

3Le

vel 1

4Le

vel 1

5Le

vel 1

6Le

vel 1

7Le

vel 1

8Le

vel 1

9Le

vel 2

0N

A

prop_15

0

500

1000

1500

Leve

l 1 NA

prop_16

0

500

1000

1500

Leve

l 1 NA

prop_17

0

250

500

750

1000

1250

Leve

l 1Le

vel 2

Leve

l 3Le

vel 4

Leve

l 5Le

vel 6

Leve

l 7Le

vel 8

Leve

l 9Le

vel 1

0Le

vel 1

1Le

vel 1

2Le

vel 1

3Le

vel 1

4Le

vel 1

5Le

vel 1

6Le

vel 1

7Le

vel 1

8Le

vel 1

9N

A

prop_18

0

100

200

300

400

500

Leve

l 1Le

vel 2

Leve

l 3Le

vel 4

Leve

l 5Le

vel 6

Leve

l 7Le

vel 8

Leve

l 9Le

vel 1

0Le

vel 1

1Le

vel 1

2Le

vel 1

3Le

vel 1

4Le

vel 1

5Le

vel 1

6Le

vel 1

7Le

vel 1

8Le

vel 1

9Le

vel 2

0Le

vel 2

1Le

vel 2

2Le

vel 2

3Le

vel 2

4Le

vel 2

5Le

vel 2

6Le

vel 2

7Le

vel 2

8N

A

prop_19

0

250

500

750

Leve

l 1

Leve

l 2

Leve

l 3

Leve

l 4

Leve

l 5 NA

prop_20

0

500

1000

1500

Leve

l 1 NA

prop_21

0

100

200

Leve

l 1Le

vel 2

Leve

l 3Le

vel 4

Leve

l 5Le

vel 6

Leve

l 7Le

vel 8

Leve

l 9Le

vel 1

0Le

vel 1

1Le

vel 1

2Le

vel 1

3Le

vel 1

4Le

vel 1

5Le

vel 1

6Le

vel 1

7Le

vel 1

8Le

vel 1

9Le

vel 2

0Le

vel 2

1Le

vel 2

2Le

vel 2

3Le

vel 2

4Le

vel 2

5Le

vel 2

6Le

vel 2

7N

ALe

vel 2

8Le

vel 2

9Le

vel 3

0Le

vel 3

1Le

vel 3

2Le

vel 3

3Le

vel 3

4Le

vel 3

5Le

vel 3

6Le

vel 3

7Le

vel 3

8Le

vel 3

9

prop_22

Figure 2.7: Class distribution of variable 17 to 24

13

Page 23: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.4. Data processing

Figure 2.5 to Figure 2.7 shows that number of classes inside each variable differ from eachother. It also shows a characteristic of imbalanced class distributions among most of thevariables, majority of observations belong to few class, while the rest of observations spreadout in different classes.

An unequal distribution of classes within each variable usually reflects data imbalance, mean-ing class labels are not represented equally. The response variable validation have two classes,"True" indicates healthy data while "False" means broken data. Table 2.2 presents the classdistributions in the data set.

Table 2.2: Distribution of broken and healthy data

Data type Count Proportion1 Broken 49361 0.032 Healthy 1463354 0.97

According to Table 2.2, broken data (3 percent) are relatively rare compare to healthy data(97 percent). Therefore broken data in this data set are also seen as the anomalous objects, or,outliers, as they are different from the majority objects (healthy data).

The distribution of broken data inside each explanatory variable in most cases follows theclass frequency, in order words, the most frequent class also have the most broken data. Fig-ure 2.8 presents the relationship between the frequencies of the most common class insideeach variable and the proportion of broken data distributed in this class. It is clearly shownthat classes with higher frequency tends to have a larger proportion of broken data.

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Frequency of the most frequent class in different variable

porpotion of broken data distributed in same class

Figure 2.8: Scatter-plot between class frequency and distribution of broken data

2.4 Data processing

Data processing is the process of processing data so that it can be analyzed by the machinelearning algorithm. The idea is straight forward: transforming data into a usable form. Thisthesis applied a three-stage actions for preprocessing data before building model, first is todiagnose and remove ambiguous instances from the data set, the second is to convert cate-

14

Page 24: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.4. Data processing

gorical variables to binary variables through one-hot encoding, lastly, divide the data set intotraining set and test set.

Removing redundant instances

Most of the classification schemes work well when the classes are separable. However, thedata in many real world problems may not be separable, which means there may exist regionsin the feature space that are occupied by more than one class. Those type of data are calledambiguous data (Trappenberg and Back, 2000). Table 2.3 shows an example of ambiguousdata, where observations have exactly same features but different class labels.

Table 2.3: Ambiguous data, an example

x1 x2 x3 x4 y1 typeA typeB modelA modelB True2 typeA typeB modelA modelB True3 typeA typeB modelA modelB True4 typeA typeB modelA modelB False5 typeA typeB modelA modelB False6 typeA typeB modelA modelB True

Ambiguous data are dangerous to classification model, as they lack a well-defined classboundary, which may cause degradation of model performance (Gu, 2007). Using ambiguousdata for the training of the classification algorithms can also lead to an drastic reduction offalse predictive classifications. It is therefore suggested avoiding using data which are highlyambiguous (Trappenberg and Back, 2000).

By examining the data set, 20819 observations were found to be ambiguous and they wereremoved from the data set. After removing those observations, the class distribution remainsapproximately the same as shown in table. However, it is also worth to mention that 0.08percent of broken data and 0.01 percent of healthy data have been eliminated during thisprogress.

Table 2.4: Class distribution after removing ambiguous instances

Data type Count Proportion1 Broken 45225 0.032 Healthy 1446671 0.97

After removing the ambiguous data, the number of classes and proportion of missing valuesinside each variable have changed slightly, as shown in Table 2.5.

15

Page 25: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.4. Data processing

Table 2.5: Variable description after removing ambiguous data

Variables Levels Proportion of missing values1 sys_1 11 02 sys_2 19 03 prop_1 34 0.0159254 prop_2 4 0.8204965 prop_3 2 0.9519756 prop_4 17 0.6092567 prop_5 5 0.9857508 prop_6 8 0.0471949 prop_7 7 0.036078

10 prop_8 4 0.01012211 prop_9 6 0.01012212 prop_10 3 0.06428813 prop_11 59 0.01014014 prop_12 10 0.01983415 prop_13 2 0.01182216 prop_14 3 0.01012217 prop_15 21 0.41387118 prop_16 2 0.96812819 prop_17 2 0.01182220 prop_18 19 0.01182221 prop_19 29 0.01012122 prop_20 6 0.01012123 prop_21 2 0.99825924 prop_22 37 0.00008525 validation 2 0

When comparing the data set before and after removing ambiguous data, it can be seen thatremoving leads to dimension reduction in data set. When it comes categories inside eachvariable, totally there are 6 feature classes that has been removed. Proportion of missingvalues increased in 15 variables and decreased in 7 variables.

One hot encoding

This thesis uses scikit-learn machine learning library for modelling. As scikit-learn does notsupport categorical variables for now, it is therefore necessary to covert categorical variablesto numerical values that can be used with scikit-learn estimators.

This thesis encodes categorical variables using a one-hot encoding scheme. One-hot encodingcreates a binary column for each category, therefore it transforms a categorical variable to agroup of binary variables. Take variable sys_1 for example, the encoding progress shows infigure 2.9.

16

Page 26: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

2.4. Data processing

sys_1 Class 1 

NA Class 2 Class 3Class 3

sys_1_Class 1 sys_1_NA sys_1_Class 2 sys_1_Class 3 1 0 0 0 0 1 0 0

  0  0 1 0 0 0 0 1 0 0 0 1

One-HotEncoding 

Figure 2.9: One-hot encoding

After one-hot encoding on all the independent variables, the new data frame contains 312 bi-nary independent variables that corresponds to the total number of classes (categories) insideall explanatory variables in the original data set, and 1 dependent variable (validation). Thelearning algorithm was built on this binary data frame.

Data partitioning

After one-hot encoding, the data set was divided into two parts, with 70 percent of obser-vations as the training set and 30 percent observations as the test set. The division has beendone on a random basis in order to ensure the consistency in the characteristic of the data inthe test set as well as the training set.

17

Page 27: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3 Method

Choosing a proper classification method depends on first the main purpose the problem tobe solved, and second the structure of the available data (Gibert, Sanchez-Marre, and Codina,2010). This thesis applies a tree-based ensemble method for classification on the basis of thefollowing facts:

1. One of the main purpose of this thesis is finding the key features that lead to bro-ken data, therefore, an interpretable machine learning approach to handle such largeamount of data is favored. Due to Decision trees’ structural simplicity, the tree-basedmodels are considered to be highly interpretable compare with other popular classifi-cation methods (Theodoridis, Koutroumbas, et al., 2008).

2. The data set contains only nominal variables, moreover, this is an imbalanced data setwith an unequal distribution of classes. Earlier research shows that tree-based ensemblemethod tends to be an optimal choice to address the class imbalance problem (Galar,Fernandez, Barrenechea, Bustince, and Herrera, 2012). This thesis is therefore an knowl-edge application of tried and tested methods in the new context.

This chapter starts with a theoretical introduction of supervised learning and classification.Then present the challenges and available methodologies behind learning with imbalanceddata. The main part of this chapter describes decision tree algorithm and explains the hyper-parameters in the Python implementation of random forest algorithm.

3.1 Supervised learning

The data set, as presented in Chapter 2, contains a collection of attributes that can be dividedinto two groups: x, the input metadata attributes, and y, the output, a binary class labelvalidation who tells if the object data are broken (validation = False) or not (validation = True).

This thesis seeks to build an explanatory model to identify if the object data are broken ornot based on the information provided. This is a supervised learning task of assigning class

18

Page 28: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.2. Learning with imbalanced data

labels to unlabeled objects using model developed from objects with known class labels (Tanet al., 2014). The key difference between supervised learning and unsupervised learning isthat the data are labeled in supervised learning. There are two types of modeling tasks insupervised learning, classification, which is used for discrete target variables, and regression,which is used for continuous target variables (Tan et al., 2014). As the class label y in this dataset is a binary-valued attribute, the modeling is further defined as a classification task.

Classification

Classification is a systematic approach to map each attribute set x to the predefined class labely through a target function f, which is the classification model (Tan et al., 2014). This thesisapplies a binary classification model on the metadata attributes, as shown in figure 3.1.

Metadata Attribute

(x)

Broken (y)Binary

ClassificationModel Healthy (y)

Figure 3.1: Classification task

The general approach to solve a classification problem start with splitting data set into trainingset and test set. Then a classification technique (or classifier) is used to model the training set.The technique employs a learning algorithm to identify a model that best fits the relationshipbetween the attribute set and class label of the input data. Finally, the test set is used toevaluate the model (Tan et al., 2014). The whole process visualizes in Figure 3.2.

Data

Training set 

Test set 

Model building andtraining

Model evaluation

Evaluation metrics:- Accuracy- AUC (ROC)- Recall- Precision

Figure 3.2: Classification workflow

3.2 Learning with imbalanced data

Imbalanced data exists when one class is underrepresented in the data set (Galar, Fernandez,Barrenechea, Bustince, and Herrera, 2012). The number of instances belonging to each classdiffers significantly in an imbalanced data. The data set used in this thesis is an imbalanceddata, where 97 percent of data are labeled as healthy and 3 percent are labeled as broken, inorder words, the number of broken data are relatively rare as compared to healthy data, asshown in Figure 3.3.

There are number of problems to existing classification algorithms when deal with imbal-anced data:

19

Page 29: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.2. Learning with imbalanced data

1. The classifier can be heavily biased toward the majority class (Hoens and Chawla, 2013).Imbalance data sets degrades the performance of machine learning techniques as theoverall accuracy and decision making are biased towards the majority class, which leadto misclassifying the minority class samples or furthermore treated them as noise (El-rahman and Abraham, 2013).

2. The accuracy measure may not be well suited for evaluating models derived from im-balanced data. Measures that are used to evaluate the learning algorithm may need tobe modified to focus on the rare class (Tan et al., 2014).

3. Because the rare class’s instances occur infrequently, models that describe the rare classtend to be highly specialized. As a result, many of the existing classification algorithmsmay not effectively detect instances of the rare class (Tan et al., 2014).

0

500

1000

1500

broken data healthy data

count (in thousands)

Figure 3.3: Class distribution in target variable y

Numerous methods have been developed to overcome the class imbalance problem. Thoseproposals can be categorized into three groups (Galar, Fernandez, Barrenechea, Bustince, andHerrera, 2012).

1. The algorithm level (internal) approach: Create or modify the algorithms that exist.

2. The data level (external) approach: The idea is to re-balance the data distribution inorder to decrease the effect the skewed class distribution in the learning process bysampling-based methods. For example, random under-sampling, by discarding the ma-jority class to create a more balanced distribution, or random over-sampling, by copyingand repeating the minority class instances.

3. Cost-sensitive methods: Cost-sensitive learning takes the cost of being incorrectly clas-sifying into consideration during model building, each class is given a misclassificationcost. The goal of a cost-sensitive method is to minimize the total misclassification cost.

This thesis applies the algorithm level approach by using en ensemble method to improvethe performance of the classifier. This is due to the fact that both data level approach and

20

Page 30: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.3. Evaluating the performance of a classifier

cost-sensitive method have some disadvantages. While both sampling technique create morebalanced distributions, they also suffer from drawbacks. Under-sampling can lead to loosof important data and over-sampling can cause drastic over-fitting (Galar, Fernandez, Bar-renechea, Bustince, and Herrera, 2012). Furthermore, due to lack of prior knowledge, themisclassification costs are usually unknown and hard to choose in practice.

3.3 Evaluating the performance of a classifier

Confusion matrix

Evaluation of the performance of a classification model is based on the number of test recordscorrectly and incorrectly predicted by the model. The measure of the model performance hasbeen done on the test set as it provides an unbiased estimate. These counts are tabulated in aconfusion matrix. For binary classification with imbalanced data set, the rare class is denotedas the positive class, while the majority class is denoted as the negative class (Tan et al., 2014).

Table 3.1: Confusion matrix for for a binary classification problem

Predicted ClassPositive +(Predicted broken data)

Negative -(Predicted healthy data)

ActualClass

Positive +(Actual broken data) TP (True positive) FN (False negative)

Negative -(Actual healthy data) FP (False positive) TN (True negative)

The four outcomes from condusion matrix are defined as follows (Tan et al., 2014):

• True positive (TP): the number of positive examples correctly predicted by the classifi-cation model.

• False negative (FN): the number of positive examples wrongly predicted as negative bythe classification model.

• False positive (FP): the number of negative examples wrongly predicted as positive bythe classification model.

• True negative (TN): the number of negative examples correctly predicted by the classi-fication model.

Summarizing the information from confusion matrix to a performance metric can make it easierto compare the performance of different models. The accuracy measure is the most commonlyused performance metric, it calculates the number of correct predictions for both positive andnegative classes. Most classification algorithms seek models that attain the highest accuracy,which is defined as follows:

Accuracy =Number of correct predictions

Total number of predictions=

TP + TNTP + FN + FP + TN

Evaluation metrics

When attempting to evaluate the performance of classification models on imbalanced data,accuracy is not a valuable evaluation metric as it overemphasize the performance of the major-ity class (Hoens and Chawla, 2013). Therefore, following alternative metrics besides accuracyare used in this thesis:

21

Page 31: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.3. Evaluating the performance of a classifier

• TPR (True positive rate) or sensitivity, is the fraction of positive examples predictedcorrectly by the model.

TPR =TP

TP + FN

• TNR (True negative rate) or specificity is defined as the fraction of negative examplespredicted correctly by the model:

TNR =TN

TN + FP

• FPR (False positive rate) is the fraction of negative examples predicted as a positiveclass:

FPR =FP

TN + FP

• Recall is the fraction of records correctly predicted by the classifier, it is class specific,the value of recall is equivalent to true positive rate when it comes to positive classes.

r =TP

TP + FN

• Precision determines the fraction of records that actually turns out to be positive in thegroup the classifier has declared as a positive class, it is also class specific, for positiveclass recall calculated as:

p =TP

TP + FP

When deal with imbalanced data set, a correct classification of the rare class often has greatervalue than a correct classification of the majority class. For example, it is probably worse topredict that a data is healthy when it is actually broken, than vice versa. Recall and precisionare two widely used metrics in such situation (Tan et al., 2014).

ROC and AUC measure

The receiver operating characteristic (ROC) curve is a graphical method to illustrate the trade-off between true positive rate (recall) and false positive rate of the classifier. In an ROC curve,the true positive rate (TPR) is plotted along the y axis and the false positive rate (FPR) isshown on the x axis. The ROC curve is useful for comparing the relative performance be-tween different models. Each point along the curve corresponds to one of the models inducedby the classifier (Tan et al., 2014).

The area under the ROC curve (AUC) provides another approach for evaluating which modelis better on average. If the model is perfect, then its area under the ROC curve would equal1. If the model simply performs random guessing, then its area under the ROC curve wouldequal 0.5. A model that is strictly better than another would have a larger area under theROC curve (Tan et al., 2014).

Figure 3.4 shows two ROC curve plot with two models. An ideal model should have a highTPR (true positive rate) and low FPR (false positive rate). It can be read from the left plotthat model 1 have both high TPR and FPR, as both values are close to 1. This indicates thatmodel 1 manages to predict the positive instances correctly, with a high accuracy (almost 100percent), but at the same time it misclassifies negative instances with a high error (almost 100percent).

22

Page 32: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.3. Evaluating the performance of a classifier

Figure 3.4: Receiver operating characteristic (ROC) for binary classification

The right plot shows that model 2 has a high TPR (almost 0.8) but a low FPR (around 0.5).This means that model 2 manages to predict positive instances with certain level of accuracyand maintain a low error rate when classify negative instances. When using AUC as themodel evaluation metric, model 2 is better than model 1 as it has a larger area under ROC(0.65) compare to model 1 (0.53).

Cross validation

Cross-validation is an approach to measure the performance of the model by using part ofthe training data to fit the model, and a different part to test it. The k-fold cross-validationmethod generalizes this approach by dividing the data into k equal-sized partitions. Duringeach iteration, one of the partitions is chosen for testing, while the rest of them are used fortraining. This procedure is repeated k times so that each partition is used for testing exactlyonce (Tan et al., 2014).

1 2 3 4 5

Train Train Train Validation Train

Train Train Validation Train Train

Train Validation Train Train Train

Validation Train Train Train Train

Train Train Train Train Validation

1st iteration

2st iteration

3st iteration

4st iteration

5st iteration

Figure 3.5: Five-fold cross validation

Figure 3.5 illustrate the process of a five-fold cross-validation. The learning algorithm is fit onfour-fifths of the training data, and the prediction error is computed on the remaining one-fifth. The total prediction error is calculated by summing up the errors for all five rounds tojudge the performance of the learning algorithm.

Determining the right value of k is a critical task. When k is small, cross-validation can bebiased, when k is relatively large, then cross-validation can have high variance. Overall, five-or tenfold cross-validation are recommended as a good compromise (Friedman, Hastie, andTibshirani, 2001). This thesis used a five-fold cross-validation when training the model.

23

Page 33: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.4. Decision Tree

3.4 Decision Tree

Decision tree is a hierarchical structure consisting of nodes and directed edges. There arethree types of nodes in a tree (Tan et al., 2014):

1. A root node that has no incoming edges and zero or more outgoing edges.

2. Internal nodes, have exactly one incoming edge and two or more outgoing edges.

3. Leaf or terminal nodes, have exactly one incoming edge and no outgoing edges.

To illustrate how decision tree work, the fictitious data from Table 3.2 uses in this section:

Table 3.2: Example data

Software Hardware Tool Y1 TypeA Model1 Yes Broken2 TypeB Model2 Yes Broken3 TypeB Model1 No Broken4 TypeA Model2 Yes Broken5 TypeB Model3 Yes Healthy6 TypeB Model2 No Broken7 TypeA Model3 Yes Broken8 TypeB Model1 Yes Healthy9 TypeB Model2 No Broken

10 TypeB Model1 Yes Healthy

In a decision tree, a class label is assigned to each leaf node. Root nodes and internal nodescontain attribute test conditions to separate observations that have different characteristics,as illustrated in Figure 3.6.

Software

Y = Broken  Hardware

Tool Y = Broken 

Y = Healthy Y = Broken 

Type A Type B

Model 2Model 1Model 3

NoYes

Root node

Internal nodes

Leaf nodes

Figure 3.6: A decision tree for the example data

24

Page 34: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.4. Decision Tree

How to build a decision tree

Trees can be constructed upon different combinations of attribute values. A decision treelearning algorithm must address the following two issues (Tan et al., 2014):

1. A splitting criteria to decide the best way to split the training observations.

2. A stopping condition to terminate the tree-growing process.

Measures for selecting the best split

The classification goal is to separate the classes, therefore, the best split is often recognizedas bringing purer partitions. The measures developed for selecting the best split are oftenbased on the degree of impurity of the child nodes. The smaller the degree of impurity, themore skewed the class distribution, and the more purer the partition(Tan et al., 2014). Letp(i|t) represents the fraction of instances belonging to class i at a given node t. The two mostcommonly used impurity measures include:

Gini(t) = 1´c´1ÿ

i=0

[p(i|t)]2

Entropy(t) = ´c´1ÿ

i=0

p(i|t)log2 p(i|t)

Where c is the number of classes, i stands for number of observations (instances) that belongto each class.

Splitting of nominal attributes using Gini index

Attributes can be split in different ways based on the type of values they carry, for exam-ple, a nominal attribute can have many values and therefore can produce multiway splits(Tan et al., 2014). This thesis uses scikit-learn machine learning library that was built on theCART algorithm, which only produces binary splits. Take Table 3.2 for example, the attributeHardware can be split in three ways:

Table 3.3: Splitting nomial attributes

(1)

HardwareModel1,Model2 Model3

Broken 6 1Healthy 2 1

(2)

HardwareModel1,Model3 Model2

Broken 3 4Healthy 3 0

(3)

HardwareModel2,Model3 Model1

Broken 5 2Healthy 1 2

The Gini index for the first split can be calculated in the following step: first calculate the Giniindex for group Model1, Model2:

Gini = 1´ (6

6 + 2)2 ´ (

26 + 2

)2 = 0.375

25

Page 35: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.4. Decision Tree

Then calculate the Gini index for group Model3:

Gini = 1´ (1

1 + 1)2 ´ (

11 + 1

)2 = 0.5

The weighted average Gini index for this split is:

6 + 210

ˆ 0.375 +1 + 1

10ˆ 0.5 = 0.4

Similarly, for the second split, the weighted average Gini index is 0.3, and for the third split,the weighted average Gini index is equal to 0.3666667. The second split has a lower Giniindex, it is therefore preferred over all other splits as it produces purer partition.

Splitting of nominal attributes using Information gain

When entropy is used as the impurity measure, the split is determined by comparing thedegree of impurity before splitting and after splitting. This difference in entropy is knownas the information gain. The larger the information gain, the better the split. The best split ischosen when the split maximizes the information gain.

Take Table 3.2 for example, before split, the attribute Hardware contains 7 broken data and 3healthy data, the degree of impurity before splitting can be calculated as:

entropy = ´(7

7 + 3)log2(

77 + 3

)´ (3

7 + 3)log2(

37 + 3

) = 0.8812909

After splitting by using the first split in Table 3.3, the degree of impurity for each groupcan be calculated separately using the same formula, the entropy for group Model1, Model2is 0.8112781, and for group Model3 is 1. The weighted degree of impurity for the first splitbecome:

(6 + 2

10)ˆ 0.8112781 + (

1 + 110

)ˆ 1 = 0.8490225

The information gain is equal to 0.8812909´ 0.8490225 = 0.0322684.

Similarly, the information gain for the second split can be calculated as 0.2812909, and for thethird split 0.0912775. The second split is therefore favored as it brings the largest informationgain.

Decision tree algorithm

Algorithm 1 describes a skeleton decision tree induction algorithm. It consists of two inputparameters, the training observations E and the attribute set F. The algorithm works by re-cursively selecting the best attribute to split the data (Step 7) and expanding the leaf nodes ofthe tree until the stopping criterion is met (Tan et al., 2014).

The tree-growing process will be stopped when some stopping rule is applied, by defining acertain threshold to user-defined hyper-parameters (for example the max depth of the tree),otherwise the tree will continue expanding until all observations have the same class label orthe same attribute values (Tan et al., 2014).

26

Page 36: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.5. Ensemble methods

Algorithm 1 A skeleton decision tree induction algorithmTreeGrowth(E,F)

1: if stopping_condition(E, F) = true then2: leaf = createNode().3: leaf.label = Classify(E). Ź Determine the class label to the leaf node4: return leaf.5: else6: root = createNode().7: root.test_cond = find_best_split(E, F).8: let V = {v|v is a possible outcome of root.test_cond}.9: for each v P V do

10: Ev = { e|root.test_cond(e) = v and e P E}11: child = TreeGrowth(Ev, F).12: add child as descedent of root and label the edge (rootÑ child) as v.13: end for14: end if15: return root.

Characteristics of decision tree

As discussed in the beginning of this chapter, decision trees were chosen as the base classifierdue to their capability of solving the classification problem and explaining the model byproviding feature importance rate. Decision trees have the following properties:

1. They are relatively fast to construct and they produce interpretable results when thetrees are small (Friedman, Hastie, and Tibshirani, 2001).

2. They can also handle categorical predictor variables and missing values (Friedman,Hastie, and Tibshirani, 2001).

3. Furthermore, feature selection occurs naturally as part of the decision tree algorithm, asthe algorithm itself decides which attributes to use and which to ignore. Decision treesare thereby resistant to many irrelevant predictor variables (Tan et al., 2014).

Generally speaking, decision trees are considered to be an "off-the-shelf" method in datamining, a one that can be directly applied to the data without requiring a great deal of time-consuming data preprocessing or careful tuning of the learning procedure (Friedman, Hastie,and Tibshirani, 2001).

3.5 Ensemble methods

In order to improving the classification accuracy, ensemble methods, also known as classifiercombination methods can be applied to the classification’s model. An ensemble method con-structs a set of base classifiers from training data and performs classification by taking a voteon the predictions made by each base classifier. The ensemble classifier increases not only theperformance of the classification, but also the confidence of the results (Cho and Won, 2007).

Ensemble methods tend to perform better than any single classifier under two necessary con-ditions: the base classifiers are independent of each other and do better than a classifier thatperforms random guessing (Tan et al., 2014).

The ensemble of classifiers can be constructed in many ways. Bagging is one example ofensemble methods that manipulate the training set according to an uniformed sampling dis-tribution. Random forests is a class of ensemble methods that specifically designed for decision

27

Page 37: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.5. Ensemble methods

tree classifiers, it manipulates its input features and uses decision trees as its base classifiers(Tan et al., 2014). Random forests uses the concept of Bagging in tandem with random featureselection (Theodoridis, Koutroumbas, et al., 2008).

Bagging

Bagging, also called bootstrap aggregating, is a technique that can reduce the variance of thebase classifiers and improve the generalization error performance. The basic concept is todraw a number of m samples from the original training set of m items, using bootstrap tech-niques, by uniformly sampling from training set with replacement (Theodoridis, Koutroum-bas, et al., 2008). Each bootstrap sample has the same size as the original data, and same itemcan be selected multiple times because the sampling is done with replacement. A classifier isthen constructed for each of the bootstrap sample. The final decision is assigned to the classwith the highest number of votes (Tan et al., 2014).

Random forest

Random forest is a substantial adjustment of bagging that builds a large collection of de-correlated decision trees and averages the predictions. The essential idea in bagging to reducethe variance, and the idea in random forest is to improve the variance reduction of baggingby reducing the correlation between the trees. This is achieved in the tree-growing processthrough random selection of the input variables (Friedman, Hastie, and Tibshirani, 2001).Therefore, when growing a tree on a bootstrapped dataset, random forest select m ď p of theinput variables at random as candidates for splitting and then select the best split variableamong the m variables, as presented in Figure 3.7.

Original Training data  Randomize

Dt...

Final TreeStep 3: 

Combine predictions usingmajority vote

Step 1: Randomly select m input

features from the p variables

Step 2:Use random

selectedfeatures to build

multipledecision trees

D(t-1)D2D1

Figure 3.7: Random Forest workflow

Due to the fact that the performance of random forests only depends on the number m of rel-evant variables, and not on all p variables, making the method robust to overfitting(Louppe,2014). Random forest are also been recognized as "most interpretable" by researchers (Fried-man, Hastie, and Tibshirani, 2001). The idea of how random forest works in detail presentsin Algorithm 2.

28

Page 38: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.6. hyper-parameter optimization

Algorithm 2 Random Forest for Classification

1. For b = 1 to B:

a) Draw a bootstrap sample Z of size N from the training data

b) Grow a random-forest tree Tb to the bootstrapped data, by recursively repeatingthe following steps for each terminal node of the tree, until the minimum node sizenmin is reached.

i. Select m variables at random from the p variables.ii. Pick the best variable/split-point among the m.

iii. Split the node into two daughter nodes.

2. Output the ensemble of trees tTbuB1

To make a prediction at a new point x:Let Cb(x) be the class prediction of the bth random-forest tree.Then CB

r f (x) = majorityvotetCb(x)uB1 .

3.6 hyper-parameter optimization

Hyper-parameter in machine learning are parameters whose value must be defined outsidethe learning process. For example, the number of decision trees in random forest can not belearned by the algorithm and must be set prior to fitting (Albon, 2018). The performance ofmodern machine learning methods highly depends on their hyper-parameter settings (Rijnand Hutter, 2018). Earlier research also showed that tuning the hyper-parameters can im-prove the performance of random forest (Probst, Wright, and Boulesteix, 2018). The problemof identifying a good value for hyper-parameters is called the problem of hyper-parameter op-timization or hyper-parameter tuning. The tuning progress has great importance in empiricalmachine learning work (Bergstra and Bengio, 2012). It is also one of the most important partin the modelling progress in this thesis.

Table3.4 is a full list of hyper-parameters that included in random forest classifier from Scikit-learn machine learning library version 0.20.3(scikit-learn, 2018). Those hyper-parameters arefurther divided into three groups by their functions and tuning usability:

1. Hyper-parameters in bold texts are considered to be "tunable", which means model’sperformance changes by setting different values to those parameters.

2. Hyper-parameters in italic texts are deprecated, they will not be tuned, eitherbecause model in favor of another hyper-parameter, for example, the parametermin_impurity_split will be removed in the Scikit-learn machine learning library version0.25 because model in favor of min_impurity_decrease. Or because they do not havepractical significance, for example, the parameter min_weight_fraction_leaf is deprecatedin version 0.20, as it was effective for regularization and tuning at worst would producebad splits (scikit-learn, 2018).

3. The left over hyper-parameter are miscellaneous parameters outside the learning algo-rithm which affect overall functionality, the default values are used on most of thoseparameters, except for random_state, which was set equal to 0, in order to have a con-stant result when rerun the model.

29

Page 39: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.6. hyper-parameter optimization

Table 3.4: Description of hyper-parameters for random forest classifier in Scikit-learn 0.20.3

Hyper-parameter Type Description

n_estimators interger(int) The number of decision trees to conduct

criterion stringDetermine the quality of a split.Supported criteria are: “gini” - Gini impurity“entropy” - information gain

max_depth int or None The maximum depth of the tree. If None, thennodes are expanded until all leaves are pure

min_samples_split int or float1Minimum number of samples to split a node.If int, number of minimum samples for splitIf float, percent of minimum samples for split

min_samples_leaf int or float Minimum number of observations to be at a leaf

min_weight_fraction_leaf float Minimum weight percentage of total sum ofweights (of all the input samples) to be at a leaf.

max_features int, stringfloat, None

Number of features choosing for the best splitIf int, number of features at each split.If float, percent of features at each splitIf “auto”, then = sqrt(total number of features)If “sqrt”, then = sqrt(total number of features)If “log2”, then = log2(total number of features)If None, then = total number of features

max_leaf_nodes int or None Maximum number of leaves.If None then unlimited number of leaf nodes.

min_impurity_decrease float Minimum decrease of impurity to split a node

min_impurity_split2 floatThreshold for early stopping in tree growth. Anode will split if its impurity is above this value.Otherwise it becomes a leaf.

bootstrap boolean3 Whether or not to sample with replacement

oob_score boolean

In random forest, each decision tree is trainedusing a subset of observations. The unusedones called out-of-bag (OOB) observations.They can be used as a test set to evaluatemodel’s performance. It is an alternativeto cross-validation (default = False).

n_jobs int or NoneThe number of models to train in parallelif = -1, use all cores on the computerif = 1, use one core (default = 1)

random_state int or None Control the random number generator used

verbose booleanWhether to generate detailed logginginformation while running algorithm or not(default = False)

warm_start booleanWhen True, reuse previous trees and addmore estimators to random forest. When False,just fit a whole new forest (default = False).

class_weight dict4, Noneor balanced

Used to correct for imbalanced classes.If supplied with desired weights to each class,the classifier will weight the classes accordingly.If balanced, uses automatically adjust weightsinversely proportional to class frequencies,the smaller class will be weighted more.

1 Boolean is a data type with only two possible values ("true" and "false").2 Has been deprecated in favor of min_impurity_decrease and will be removed scikit-learn version 0.21.3 Float is a data type that refers to real numbers with decimals.4 Dict refers to Python dictionary, it is an un-ordered collection of items.

30

Page 40: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.6. hyper-parameter optimization

Identifying the optimal hyper-parameter values can be a complex problem. While variouspost-hoc analysis techniques exist based on a given dataset and algorithm, determine whatwere the most important hyper-parameters and which of their values tended to yield goodperformance (Rijn and Hutter, 2018). The empirical evidence is still limited to a few datasetsand therefore not practical for applying on the specific data used in this thesis.

Van and Frank (2018) have introduced a method on many different data sets to discoverwhich parameters impact performance most for random forest algorithm and which are goodhyper-parameter values that are most likely to yield high performance. Their results revealthat most of random forest’s variance could be attributed to a small set of hyper-parameters.The minimum samples required to create a leaf and maximal number of features for deter-mining the split were most important. Both of these hyper-parameters were significantlymore important than the others. Parameter bootstrap and criterion are also important insome of the cases. However, in order to explore the model’s performance in a broader con-text, this thesis applied a complete hyper-parameter optimization on all of the tunable hyper-parameters.

Random search for hyper-parameter optimization

Knowing in general very little about a good range for a specific hyper-parameter, the generalstrategy for finding the optimal hyper-parameters values is to try some random numbers onthe model to evaluate for each one, and choose the combinations of hyper-parameter valuesthat worked the best(Bergstra and Bengio, 2012). There has been a lot of work and progresson hyper-parameter optimization, with methods including grid search, random search, or moreadvanced techniques such as Bayesian optimization, evolutionary optimization and so on (Rijnand Hutter, 2018).

This paper applied a random search approach for hyper-parameter optimization. It is amore efficient method to search over a specific number of random combinations of hyper-parameter values from specified distributions that referred to as hyper-parameter space. Itworks in this way: first specify a distribution for each hyper-parameter (e.g., a uniform distri-bution), and then randomly sample without replacement hyper-parameter values from thatdistribution. If specify a list of values such as two regularization penalty hyper-parametervalues (True and false), then will randomly sample with replacement from the list (Albon,2018). Instead of trying out all possible combinations, Randomized search evaluates a givennumber of random combinations by selecting a random value for each hyper-parameter atevery iteration (Bergstra and Bengio, 2012).

Random search has distinct advantages in hyper-parameter optimization. Earlier researchshows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid, this is because firstly not all hyper-parametersare equally important to tune. Allocating too many trials to the exploration of dimensions thatdo not matter and suffer from poor coverage in dimensions that are important. Moreover,compared with the grid search experiments, random search found better models in mostcases and required less computational time (Bergstra and Bengio, 2012).

Table3.5 presents the hyper-parameter distributions that used in this thesis. All numericalvalues are configured in a uniform distribution. The value ranges were chosen based onexperiences from earlier researches and knowledge of the training set.

31

Page 41: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.7. Feature importance rate

Table 3.5: Hyper-parameter space settings for random search

hyper-parameter hyper-parameter space Parameter Typebootstrap (True, False) Random-Forest-Specific

class_weight (balanced, None) Random-Forest-Specific

criterion (gini, entropy) Random-Forest-Specific

max_depth Uniform distributionbetween 3 and 30 Decision Tree-Specific

max_features Uniform distributionbetween 0.01 and 0.9 Decision Tree-Specific

max_leaf_nodes Uniform distributionbetween 10 and 200 Decision Tree-Specific

min_impurity_decrease Uniform distributionbetween 0 and 0.9 Decision Tree-Specific

min_samples_split Uniform distributionbetween 1 and 1000 Decision Tree-Specific

min_samples_leaf Uniform distributionbetween 1 and 100 Decision Tree-Specific

n_estimators Uniform distributionbetween 50 and 1000 Random-Forest-Specific

After deciding an appropriate value range for hyper-parameters, it is also important to decidescoring parameter. This thesis uses two separate scoring parameters in order to compare themodel performance, AUC score and accuracy rate.

When choosing AUC score to be the scoring parameter, meaning model put more focus oncorrect classification of the positive class (broken data). This scoring parameter calculate thearea under the Receiver Operating Characteristic Curve (ROC) from prediction scores basedon cross validation, and random search returns the set of hyper-parameters that maximizesthe score.

When choosing accuracy rate to be the scoring parameter, meaning model prioritize the over-all prediction accuracy for both classes. Random search will retun the set of hyper-paramtersthat maximizes the accuracy rate.

3.7 Feature importance rate

The input predictor variables in machine learning applications are seldom equally relevantfor predicting Y. Often only a small subset of them have substantial influence on the responsevariable. It is often useful to learn the relative importance or contribution of each input vari-able in predicting the response (Friedman, Hastie, and Tibshirani, 2001).

The Scikit-learn machine learning library uses Mean Decrease Impurity (MDI) to evaluate theimportance of a variable Xm for predicting Y. The MDI is defined as the average decreasein node impurity when variable Xm is used to split a node. In the context of random forest,

32

Page 42: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

3.8. Modelling process

the feature importance is calculated in three stages: first calculate the decrease of impurityeach time variable Xm is used in each decision tree, then adding up the weighted impuritydecreases for all nodes for where Xm is used, and lastly average over all trees in the forest(Louppe, Wehenkel, Sutera, and Geurts, 2013). The formula is further defined as:

Imp(Xm) =1

NT

ÿ

T

ÿ

tPT:Xm

p(t)4i(st, t)

Here p(t) is the proportion ( NtN ) of samples reaching all nodes t and Xm is the variable used in

split a certain node st. The impurity decrease can either be calculated using Gini index or in-formation gain (entropy), depend on which measurement man choose for deciding splitting.

3.8 Modelling process

In conclusion, the method used in this thesis is composed of mainly two phases: data cleaningand classification (modelling). In the data cleaning step, the data are preprocessed to elim-inate the ambiguous observations that may cause degradation of prediction performance.Removing the ambiguous instances from the training data will potentially reduce the falsenegative rate of the model trained from it (Trappenberg and Back, 2000).

In the classification step, first a decision tree-based random forest is built to produce a classlabel for each of the instances in the given training dataset. Then the model is carefully trainedthrough hyper-parameter optimization on 10 of the hyper-parameters in the model. Finally,the model with the best performance has been chosen to fit on the test data set. The finalresult presented in Chapter 4.

Data cleaning 

Data splitting

Training set 

Test set Fit random forest classifier

Hyperparameter optimization

Choose parameters with thebest performance 

Fit chosen modelon test set

Model evaluation

Cross validation

Figure 3.8: Modelling workflow

33

Page 43: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

4 Results

This chapter presents and compares the classification result of random forest on the test setwhen using AUC score and alternatively accuracy rate as the evaluation metric to train themodel.

4.1 AUC-optimized random forest

When "roc-auc" was used as the scoring parameter in hyper-parameter optimization, mean-ing the goal of random search is to find the optimal combination of hyper-parameters thatproduce the highest AUC score. Table 4.1 presents the best parameters set found by random-ized search that maximizes the AUC score:

Table 4.1: Hyper-parameters that maximizes the AUC score

Parameter Value Explanationbootstrap False Sampling has been done without replacementclass_weight Balanced The algorithm adjust the weight and create a balanced training datacriterion ’entropy’ Using information gain as the split criterionmax_depth 21 Maximum depth of the treemax_features 0.33 The best split was made by 33 procent of the total number of featuresmax_leaf_nodes 104 Each tree has maximum 104 leaves

min_impurity_decrease 0.13 If decrease of the impurity is greater than or equal to 0.13,the node will be split

min_samples_leaf 35 It requires at least 35 observations t to be at a leafmin_samples_split 636 It requires at least 636 observations to split a noden_estimators 462 The final result was made from 462 decision trees

Classification result

The hyper-parameters from Table 4.1 were then put into the random forest classifier. Theobtained model was then applied on the test set. The classification result shows in a confusionmatrix as presented in Table 4.2:

34

Page 44: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

4.1. AUC-optimized random forest

Table 4.2: Confusion matrix from AUC-optimized model

PredictedClass = Broken Class = Healthy

Actual Class = Broken 9160 4324Class = Healthy 109017 325068

As it shown from Table 4.2, classification results in a large number of false positives (109017),which means a large number of healthy data has been classified as broken data by the model.

Model evaluation

Based on the confusion matrix, the accuracy score for this classification can be calculated as :

Accuracy =TP + TN

TP + FN + FP + TN=

9160 + 3250689160 + 4324 + 109017 + 325068

= 0.74676

AUC score for this classification is 0.71409. The AUC score is calculated based on the truepositive rate (TPR) and false positive rate (FPR):

TPR =TP

TP + FN=

91609160 + 4324

= 0.6793236

FPR =FP

TN + FP=

109017325068 + 109017

= 0.2511421

Figure 4.1 show ROC curve and the calculated AUC score.

Figure 4.1: ROC curve for AUC-optimized model

True positive rate is the ratio of broken data predicted as broken. While false positive rate isthe ratio of healthy data predicted as broken. The result shows that this model predicted 68

35

Page 45: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

4.2. Accuracy-optimized random forest

percent of broken data correctly, however, it also misclassified 25 percent of healthy data asbroken.

Precision and recall measures can be used to compare the classification performances of twoclasses in detail. Precision is the ratio of correctly predicted test instances. In order words,precision tells how many data are actually broken data from the predicted broken data. Pre-cision for each class calculates as follows:

For broken data:p =

TPTP + FP

=9160

9160 + 109017= 0.07751085

For healthy data:

p =TN

FN + TN=

3250684324 + 325068

= 0.9868728

Recall stands for percentage of test instances that has been correctly predicted. In orderwords, recall says how many broken data in the test set have been correctly recognized asbroken data. It is calculated as follows:

r =TP

TP + FN=

91609160 + 4324

= 0.6793236 (Same as TPR)

While recall for healthy data is calculated as follows:

r =TN

FP + TN=

325068109017 + 325068

= 0.7488579

The rounded result for Recall, precision and F1-score presents in Table 4.3:

Table 4.3: Precision and recall from the model

Precision RecallClass = Broken 0.08 0.67Class = Healthy 0.99 0.75

Building a model that maximizes both precision and recall is the key challenge for machinelearning algorithms (Tan et al., 2014). This model classifies more healthy data to the positiveclass (broken data), which results in a high recall but low precision for the positive class. Ingeneral, this AUC-optimized model achieved good performance on the positive class (brokendata) at the cost of a relative large number of false positives.

4.2 Accuracy-optimized random forest

As a comparison to the AUC-optimized model, following hyper-parameter values comesfrom the random forest which optimized the accuracy rate.

36

Page 46: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

4.2. Accuracy-optimized random forest

Table 4.4: Hyper-parameter space for the accuracy-optimized model

Parameter Value Explanationbootstrap True Sampling has been done with replacementclass_weight None All classes are weighted equalcriterion ’entropy’ Using entropy as the split criterionmax_depth None Tree will grow until all nodes are splitmax_features sqrt The best split was made by square root of the total number of featuresmax_leaf_nodes None Each tree can have as many leaves as possible

min_impurity_decrease 0 If decrease of the impurity is greater than or equal to 0the node will be split

min_samples_leaf 1 It requires at least 1 observations t to be at a leafmin_samples_split 2 It requires at least 2 observations to split a noden_estimators 944 The final result was made from 944 decision trees

Classification result

Confusion matrix on the test data with default hyper-parameter settings presents in Table 4.5:

Table 4.5: Confusion matrix from accuracy-optimized model

PredictedClass = Broken Class = Healthy

Actual Class = Broken 1078 12406Class = Healthy 59 434026

In an imbalanced data set, majority class examples contribute more to the general accuracyrate (Galar, Fernandez, Barrenechea, Bustince, and Herrera, 2012). As it shown from Table4.5, the number of true positives is comparatively small when compare with the amount offalse negatives, which means the majority of broken data has been predicted incorrectly ashealthy data. However, the number of true negatives is much better when compare with theAUC-optimized model, which means the accuracy-optimized model have correctly predictedmost of the healthy data.

Model evaluation

The accuracy score for this classification can be calculated as :

Accuracy =TP + TN

TP + FN + FP + TN=

1078 + 4340261078 + 12406 + 59 + 434026

= 0.97215

The accuracy rate for this model is 97.215 percent, which is much higher than the AUC-optimized model.

True positive rate (TPR) and false positive rate (FPR) calculated as:

TPR =TP

TP + FN=

10781078 + 12406

= 0.0799466

FPR =FP

TN + FP=

59434026 + 59

= 0.0001359181

Both TPR and FPR values are relatively low. TPR indicates that the model classified brokendata with 7.99 percent accuracy. While FPR indicates that less than 0.01 percent of healthydata have been misclassified as broken data. This indicates a low AUC score. The calculatedAUC score from this classification is 0.539905, as shown in Figure 4.2.

37

Page 47: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

4.2. Accuracy-optimized random forest

Figure 4.2: ROC curve for Recall-optimized model

Precision for each class calculated as following, the denominator in the fraction represent thepredicted class.

For broken data:p =

TPTP + FP

=1078

1078 + 59= 0.9481091

For healthy data:

p =TN

FN + TN=

43402612406 + 434026

= 0.9722108

Recall for each class calculated as following, the denominator in recall stand for the actualclass:

For broken data (the same as true positive rate):

r =TP

TP + FN=

10781078 + 12406

= 0.0799466

While recall for healthy data is calculated as follows:

r =TN

FP + TN=

43402659 + 434026

= 0.9998641

The rounded result for Recall and precision presents in Table 4.6:

Table 4.6: Precision and recall from the accuracy-optimized model

Precision RecallClass = Broken 0.95 0.08Class = Healthy 0.97 0.99

38

Page 48: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

4.3. Feature importance rate

The broken data from Table 4.6 has a high precision and low recall, which means that theclassifier achieved good performance on the negative class (healthy data) at the cost of a highfalse negatives rate. The high accuracy rate has been reached by misclassifying broken data(minority class) as healthy data (majority class).

4.3 Feature importance rate

Random forests ranks the input variables based on by how well they improve the purity ofthe node. Following graphs present the ten features with the highest feature importance ratesfrom AUC-optimized model.

Feature importance rate from AUC-optimized model

The feature importance rate from AUC-optimized random forest indicate that system variablesys_1 and property variable prop_12 tend to be the most important variables to the classifica-tion, as shown in Figure 4.3. The random forest is not only able to identify which feature isthe most important to classification model, but also able to identify which category inside thefeature that plays the role.

sys_1_1

sys_1_2

sys_1_3

sys_1_4

sys_1_6

sys_1_7

sys_1_8

prop_12_3

sys_2_6

sys_1_5

0.0 0.1 0.2 0.3 0.4rates

features

Figure 4.3: Feature importance rate for AUC-optimized model

For AUC-optimized model, there are three variables that have impact on the target variable,the feature importance rates for all other variables are equal to 0.

Feature importance rate from accuracy-optimized model

For Accuracy-optimized model, the features importance rates shows a widespread distribu-tion. Figure 4.4 shows the 30 features with the highest feature importance rate. The featureimportance rate from random forest shows that system variables sys_2_6 and sys_1_5 are themost important variables to the classification.

39

Page 49: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

4.3. Feature importance rate

prop_4_14prop_15_8prop_12_4prop_12_5

prop_22_17prop_22_2prop_20_3sys_1_10

prop_14_1prop_19_13prop_11_42prop_14_2

prop_22_16prop_9_4

prop_22_34sys_2_2

prop_9_3sys_2_12

sys_1_7prop_20_1

prop_22_28prop_11_13prop_15_17

sys_2_11sys_1_8sys_2_5sys_1_4

prop_12_3sys_1_5sys_2_6

0.00

0.02

0.04

0.06

rates

Figure 4.4: Feature importance rate for accuracy-optimized model

Figure 4.5 shows the grouped feature importance rates for each variable, by summing up thefeature importance rates from individual categories of each variable.

prop_17

prop_13

prop_21

prop_5

prop_8

prop_3

prop_16

prop_10

prop_6

prop_2

prop_7

prop_4

prop_18

prop_14

prop_9

prop_20

prop_19

prop_1

prop_15

prop_12

prop_11

prop_22

sys_2

sys_1

0.00

0.05

0.10

0.15

rates

Figure 4.5: Grouped feature importance rates for accuracy-optimized model

40

Page 50: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

4.3. Feature importance rate

The grouped feature importance rate shows that variable sys_1 and sys_2 have the highestimportance rates.

To conclude, from variable level, the most important feature according to random forest aresys_1 and sys_2, from the category level, the most important feature according to randomforest is sys_1_5 and sys_2_6.

41

Page 51: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

5 Discussion

This chapter shares the author’s understanding and insights about data processing and ma-chine learning algorithms, and concludes with a discussion of the limitations of the studyand areas for future research.

5.1 Data

The data set used in this thesis have two distinct characters, it contains only categorical vari-ables and are severely imbalanced, which makes it different from most of the data set thathas been discussed in statistics. Ideally, it requires a good knowledge about the data in or-der to conduct a good analysis, knowledge can cover important characteristics such as theprecision of the data, the type of features, the scale of measurement, and the origin of datatan2014introduction.

However, little information about this data set has been provided to the author, as data col-lection were assigned to, and completed by different group of workers according to theirexpertise. The knowledge about this data set has been collected piece by piece from differentexperts and lacks of solid background. Obtaining more knowledge about the data set mighthelp to correct the imperfection in the data and improve the data quality. For example, thepresence of ambiguous data, missing values, inconsistent, or duplicate data might be able tobe fixed if the author have a prior knowledge about how data was collected and why theywere imperfect.

Feature selection

Although the decision tree-based model are capable of doing en embedded feature selection,which means model can decide by itself which independent variables to choose. Multiplemeasures been taken into consideration to check the correlation between explanatory vari-ables and the response variable validation.

Measures of correlations between categorical variables can be carried out through a Chi-Square test for association (Newbold, Carlson, and Thorne, 2012). However, chi square statis-tic is largely affected by the sample size and numbers of categories within variables. Due to

42

Page 52: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

5.1. Data

the size of this data set (1.5 million observations), all chi square test statistics turns out to bea very large value which has no practical usage. This result leads to another discussion ondimension reduction approach, which is sampling method.

Sampling and dimension reduction

As described in chapter3, sampling technique has been an effective way of handling imbal-anced data set. By constructing balanced data set through over-sampling or under-sampling,can sampling methods improve the predication performance of classifier. However, undersampling may cause loss of useful information and over sampling may adding redundantinformation, moreover, most of the advanced sampling technique are distance-based, whichmeans they are only applicable to continuous variables. Therefore, sampling technique hasnot been taken into consideration in this thesis.

However, during hyper-parameter optimization, setting the hyper-parameter class_weight to"balanced" can seemed to be a compensation of sampling technique, as the model will adjustweights inversely proportional to class frequencies in the input data (Pedregosa, Varoquaux,Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Pas-sos, Cournapeau, Brucher, Perrot, and Duchesnay, 2011), which make the training data setmore balanced. Hyper-parameter optimization result show that the model with highest AUCscore have class_weight set to "balanced". Which can be understood as, in order to make a bet-ter prediction on the positive instances, the imbalanced class distributions should be alteredin some way.

Minimized data

The original data was recorded automatically every 30 seconds. Since metadata informationwas configured in the beginning of each data collection tour, therefore, a large amount ofcollected data contains exactly the same metadata information. By grouping the identicalvalues in the original data frame, a minimized data frame was obtained.

The minimized data frame contains 3261 observations (rows) and 24 variables (columns) afterdata cleaning. The class distribution of variable validation shows in table 5.1.

Table 5.1: Distribution of broken and healthy data after resampling

Data type Count Proportion1 Broken 336 0.1032 Healthy 2925 0.897

However, the minimized data were not used in the data analysis due to the fact that mini-mizing might cause loss of information. The gini and entrophy criterion in decision tree arealso dependent on the amount of observations in classification.

Ambiguous data

There has always been discussions how data should be processed in machine learning andstatistical analysis. Whether to keep or remove the instances that does not have a clear classboundary have been a question and left for a deeper discussion.

43

Page 53: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

5.2. Method

5.2 Method

The categorical data type limit the applications of other learning algorithm that were de-signed originally for continuous data. When choosing the right analysis model for the dataset, it’s important to keep things like variable type (numeric, categorical, or ordinal), data size,and distributional assumptions in mind, as most of the models have particular assumptions.The author explored several other methods besides random forest:

Naive Bayes has been considered due to its capability of handling categorical variables, andproviding feature importance rate, however, Naive Bayes assume the independence betweenexplanatory variables, which this data set can not satisfy, moreover, the performance of ran-dom forest can be improved by hyper-parameter tuning, while Naive Bayes’ performancereplied much on feature selection, a good combination of features tend to give better result.Due to the limited time, this thesis was not able to apply a good feature selection techniqueon Naive Bayes. It will be a research interest in the future to compare the model performancebetween tree-based methods and Naive Bayes.

Boosting-based trees such as AdaBoost, Gradient boosting have also been considered, it willbe interesting to know how model performance will be improved after hyper-parameter opti-mization, it is also worth to mention that by using default setting in classifiers, there is no bigdifference between random forest and other boosting-based method, although random forestdid give a slightly better result in terms of AUC score. However, as random search take longtime to try on one method, it is behind the scope of this thesis to give a proper comparisonbetween those methods.

Clustered Data and mixed effects

As mentioned earlier, each observation in the metadata attribute set corresponds to a 30 sec-onds data recording. Since data collection was normally conducted on a hourly basis, there-fore there is an interest to investigate if it is a clustered data structure. A clustered structuremeans observations from the same cluster are possibly correlated while observations fromdistinct clusters are independent.

There is a tree based method, named “generalized mixed effects regression tree” (GMERT),which is suitable for non-gaussian data (e.g., binary outcomes and count data), and can han-dle unbalanced clusters (Hajjem, Larocque, and Bellavance, 2017). However, this model as-sume all the categorical variables have only fixed effects, which this data set cannot meet.Moreover, detecting the cluster patterns require variables such as "recording-time" be in-cluded in this data set, which will cause more complexity to this thesis work.

Hyper-parameter optimization

This is the most time-consuming part in the model, although random search has been provenin earlier research to be a good solution for hyper-parameter optimization, choosing a suitabledistribution for parameters require a good knowledge of both the data and the algorithm, ifcould know ahead of time which interval would be most effective, then the author coulddesign a more appropriate random search and save much time.

Hyperparameter optimization also lead to another problem, the classifier tends to have a lowaccuracy but high AUC score, this happens due to the fact that classifier achieves the goodperformance on the positive class (broken data) at the cost of a high false negatives rate (ora low number of true negative). In order words, the classifier tends to misclassify healthybroken as broken data.

44

Page 54: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

5.3. Limitations

5.3 Limitations

Due to the long computational time when using random search on hyper-parameter opti-mization and the limited time frame for this work, this thesis were not able to do a completecomparison between different techniques, such as, does ambiguous data matter to randomforest? Would keeping the ambiguous data lead to a degradation of random forest’s perfor-mance?

Further more, it will be interesting to explore which is the most important hyper-parameter inrandom forest when working with this data set. For example, hyper-parameter class_weightcan be one of the most important factors, so it will be worth to investigate what is the bestmodel performance when class_weight was set to "None" and "balanced"?

5.4 The work in a wider context

This thesis is a knowledge application that was built upon earlier research on outlier detectionon categorical data, and classification on imbalanced data. This work applied the tried andtested theories in new contexts. However, implementing a hybrid-based approach might alsobe suitable to solve the specific problem in the data set.

Ideas include a combination of unsupervised learning by using sequence analysis, or anotherpattern-based method combined with supervised learning, this approach can start with ap-plying pattern analysis on training set to find the anomalous patterns, and use this model toidentify anomalies in the test set.

Alternative evaluation metrics can also be used as the scoring parameter during hyper-parameter optimization in order to check the classification result under different assump-tions, such as precision and recall.

45

Page 55: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

6 Conclusion

The classification result shows that random forest combined with hyper-parameter optimiza-tion are able to detect and classify broken data by only learning from metadata attributes.When using accuracy rate as the model evaluation metric, the chosen model after hyper-parameter optimization is able to make predication on test set with a high accuracy rate.When AUC score was chosen as the model evaluation metric, the tuned model is able toprovide a relatively high AUC score.

The main challenge when handling with severely imbalanced data are to minimize both falsenegatives and false positives. When set to focus more on correct prediction of positive in-stances, random forest is effective in achieving a desirable true positive rate. When set tofocus more on the overall accuracy rate, random forest is capable to achieve a high true nega-tive rate, however, it is not effective in detecting both true positives and true negatives whendata is severely imbalanced.

Tree-based ensemble method are able to handle missing values and produce interpretable re-sults. Random forest is such an ensemble method that manipulates its input features throughembedded approaches, which means that there is no need to spend a large amount of time ondata preprocessing, as the tree algorithm itself can detect automatically which attributes aremost relevant in prediction. To conclude, the thesis can answer the research questions andsummarize the classification result as following:

1. Can a classification model automatically detect broken files from healthy files, by learn-ing only from metadata features?

Yes, the classification model are able to classify broken data from healthy data witha high accuracy rate (97 percent), by sacrificing the true positive rate, this means themodel tends to misclassify broken data as healthy data when chasing a high accuracyrate.

When seeking for a high AUC score, the model are also capable of predicting the ma-jority of broken data correctly by providing a desirable true positive rate, however, the

46

Page 56: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

model also produces massive false positives by misclassifying healthy data as brokendata.

Generally speaking, if correctly detecting broken data from healthy data is consideredto be more important than vice versa, then AUC score should be considered as themodel evaluation metric. When correct prediction of healthy data is considered to bemore important, then accuracy rate should be used as the model evaluation metric.

2. Which features contribute most to the classification model, in order words, which inputfeatures are most important in determining the target variable?

The system variables sys_1 and sys_2 have the highest relative importance as computedby random forest. In other words, in case of determining whether a collected data isbroken or not, we would be able to say that variables sys_1 and sys_2 has most impacton determining if it is broken data or healthy data.

47

Page 57: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

Bibliography

Albon, Chris (2018). Machine learning with Python cookbook. O’Reilly Media.

Autoliv (2018). Vision Systems. Accessed: 2019-02-07. URL: https://www.autoliv.com/products/electronics/vision-systems.

Bergstra, James and Yoshua Bengio (2012). “Random search for hyper-parameter optimiza-tion”. In: Journal of Machine Learning Research 13.Feb, pp. 281–305.

Cho, Sung Bae and Hong-Hee Won (2007). “Cancer classification using ensemble of neuralnetworks with multiple significant gene subsets”. In: Applied Intelligence 26.3, pp. 243–250.

Dam, Hoa Khanh, Truyen Tran, and Aditya Ghose (2018). “Explainable software analytics”.In: Proceedings of the 40th International Conference on Software Engineering: New Ideas andEmerging Results. ACM, pp. 53–56.

Droettboom, Michael et al. (2015). “Understanding JSON Schema”. In: http://spacetelescope.github. io/understanding-jsonschema/UnderstandingJSONSchema.pdf (accessed on 14 April2014).

Elrahman, Shaza M Abd and Ajith Abraham (2013). “A review of class imbalance problem”.In: Journal of Network and Innovative Computing 1.2013, pp. 332–340.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2001). The elements of statistical learn-ing. Vol. 1. 10. Springer series in statistics New York, NY, USA.

Galar, Mikel, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and FranciscoHerrera (2012). “A review on ensembles for the class imbalance problem: bagging-,boosting-, and hybrid-based approaches”. In: IEEE Transactions on Systems, Man, and Cy-bernetics, Part C (Applications and Reviews) 42.4, pp. 463–484.

Gibert, Karina, Miquel Sanchez-Marre, and Victor Codina (2010). “Choosing the right datamining technique: classification of methods and intelligent recommendation”. In: Interna-tional Congress on Environmental Modelling and Software, p. 367.

Goodacre, Royston (2003). “Explanatory analysis of spectroscopic data using machine learn-ing of simple, interpretable rules”. In: Vibrational Spectroscopy 32.1, pp. 33–45.

48

Page 58: Anomaly Detection in Categorical Data with Interpretable ...1330907/FULLTEXT01.pdf · learning from extremely imbalanced data set, with 97 percent of data belongs healthy data and

Bibliography

Gu, Jie (2007). “Random Forest Based Imbalanced Data Cleaning and Classification”. In: Pro-ceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’07).

Hajjem, Ahlem, Denis Larocque, and François Bellavance (2017). “Generalized mixed effectsregression trees”. In: Statistics Probability Letters 126, pp. 114–118. ISSN: 0167-7152.

Hoens, T Ryan and Nitesh V Chawla (2013). “Imbalanced datasets: from sampling to classi-fiers”. In: Imbalanced Learning: Foundations, Algorithms, and Applications, pp. 43–59.

Inc, Veoneer (2019). Who we are. Accessed: 2019-02-14. URL: https://www.veoneer.com/en/who-we-are.

Kursa, Miron B, Witold R Rudnicki, et al. (2010). “Feature selection with the Boruta package”.In: J Stat Softw 36.11, pp. 1–13.

Louppe, Gilles (2014). “Understanding random forests: From theory to practice”. In: arXivpreprint arXiv:1407.7502.

Louppe, Gilles, Louis Wehenkel, Antonio Sutera, and Pierre Geurts (2013). “Understandingvariable importances in forests of randomized trees”. In: Advances in neural informationprocessing systems, pp. 431–439.

Newbold, Paul, William Carlson, and Betty Thorne (2012). Statistics for business and economics.Pearson.

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M.Perrot, and E. Duchesnay (2011). “Scikit-learn: Machine Learning in Python”. In: Journalof Machine Learning Research 12, pp. 2825–2830.

Probst, Philipp, Marvin N Wright, and Anne-Laure Boulesteix (2018). “Hyperparameters andtuning strategies for random forest”. In: Wiley Interdisciplinary Reviews: Data Mining andKnowledge Discovery, e1301.

Rijn, Jan N van and Frank Hutter (2018). “Hyperparameter importance across datasets”. In:Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & DataMining, pp. 2367–2376.

Riley, Jenn (2017). “Understanding metadata”. In: NISO Primer Series.

Rosell, Mikael (2015). Semi-Supervised Learning for Object Detection. (Master’s thesis, LinköpingUnivesity, Linköping). URL: http://liu.diva-portal.org/smash/get/diva2:782911/FULLTEXT01.pdf.

Tan, Pang-Ning et al. (2014). Introduction to data mining. Pearson Education Limited.

Theodoridis, Sergios, Konstantinos Koutroumbas, et al. (2008). “Pattern recognition”. In: IEEETransactions on Neural Networks 19.2, p. 376.

Trappenberg, Thomas P and Andrew D Back (2000). “A classification scheme for applicationswith ambiguous data”. In: Proceedings of the IEEE-INNS-ENNS International Joint Conferenceon Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for theNew Millennium. Vol. 6. IEEE, pp. 296–301.

Vellido, Alfredo, Jose David Martin-Guerrero, and Paulo JG Lisboa (2012). “Making machinelearning models interpretable.” In: ESANN. Vol. 12. Citeseer, pp. 163–172.

Veoneer (2018). Investor Day 2018 Veoneer Transcrip. Accessed: 2019-02-05. URL: https://www.veoneer.com/sites/default/files/Investor%20Day%20May%2031,%202018_Veoneer%20Transcript.pdf.

49