insider threat detection using supervised machine learning … · 2020. 3. 10. · 1 insider threat...

26
DOI: 10.4018/IJCWT.2020040101 International Journal of Cyber Warfare and Terrorism Volume 10 • Issue 2 • April-June 2020 Copyright©2020,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited. 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced Dataset Naghmeh Moradpoor Sheykhkanloo, Edinburgh Napier University, Edinburgh, UK Adam Hall, Edinburgh Napier University, Edinburgh, UK ABSTRACT Aninsiderthreatcantakeonmanyformsandfallunderdifferentcategories.Thisincludesmalicious insider, careless/unaware/uneducated/naïve employee, and the third-party contractor. Machine learningtechniqueshavebeenstudiedinpublishedliteratureasapromisingsolutionforsuchthreats. However,theycanbebiasedand/orinaccuratewhentheassociateddatasetishugelyimbalanced. Therefore, this article addresses the insider threat detection on an extremely imbalanced dataset whichincludesemployingapopularbalancingtechniqueknownasspreadsubsample.Theresults showthatalthoughbalancingthedatasetusingthistechniquedidnotimproveperformancemetrics, itdidimprovethetimetakentobuildthemodelandthetimetakentotestthemodel.Additionally, theauthorsrealisedthatrunningthechosenclassifierswithparametersotherthanthedefaultones hasanimpactonbothbalancedandimbalancedscenarios,buttheimpactissignificantlystronger whenusingtheimbalanceddataset. KEyWoRDS Data Pre-Processing, Imbalanced Dataset, Insider Threat, Spread Subsample, Supervised Machine Learning 1. INTRoDUCTIoN Insiderattackspresentaconsiderableissueinthecyber-threatlandscape,with40%oforganisations labellingthevectorasthemostdamagingattackfaced(Cole,2017)and(Moradpoor,2017).In2016, thecontainmentandremediationofreportedinsiderthreatscostaffectedorganisations4milliondollars onaverage(PonemonInstitute,2016).Inaddition,insiderthreatsareextremelycommonamongcyber- incidents;in2015,55%ofcyber-attackswereinsiderthreatcases(Bradley,2015).Despitethehigh costandfrequentoccurrenceofinsiderthreatattacks,detectionandmitigationremainaproblem. In2018,90%ofcompaniesareregardedvulnerable(Insiders,2018).Afurther38%ofcompanies acknowledgethattheirinsiderthreatdetectionandpreventioncapabilitiesarenotadequate(Cole, 2017).Thisdisparitydemonstratesasignificantgapbetweenthecurrentadvancementsininsiderthreat detection,andtherequirementsofbusinesses.Giventheavailabilityofcomputationalresources,itis feasibletouseMachineLearning(ML)techniquestosolveproblemsoflargercomplexitythanhas previouslybeenpossible.Astrongprecedentofthiscanbeobservedinrecenthistorywiththegrowth ofthefieldofBigData.ThisisalsoexemplifiedbythehistoricachievementofGoogleDeepmind (Hassabis,2017),creatingamachinelearningalgorithmwhichmasterstheimmenselycomplexboard gameGo(Silver,2016).Mostorganisationshavetheresourcestokeeplogsofemployeeinteractions withtechnology.Byharnessingthedataproducedthroughlogging,thisinformationcouldbedigested CORE Metadata, citation and similar papers at core.ac.uk Provided by Repository@Napier

Upload: others

Post on 21-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

DOI: 10.4018/IJCWT.2020040101

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

Copyright©2020,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited.

1

Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced DatasetNaghmeh Moradpoor Sheykhkanloo, Edinburgh Napier University, Edinburgh, UK

Adam Hall, Edinburgh Napier University, Edinburgh, UK

ABSTRACT

Aninsiderthreatcantakeonmanyformsandfallunderdifferentcategories.Thisincludesmaliciousinsider, careless/unaware/uneducated/naïve employee, and the third-party contractor. Machinelearningtechniqueshavebeenstudiedinpublishedliteratureasapromisingsolutionforsuchthreats.However,theycanbebiasedand/orinaccuratewhentheassociateddatasetishugelyimbalanced.Therefore, this article addresses the insider threat detectionon an extremely imbalanceddatasetwhichincludesemployingapopularbalancingtechniqueknownasspreadsubsample.Theresultsshowthatalthoughbalancingthedatasetusingthistechniquedidnotimproveperformancemetrics,itdidimprovethetimetakentobuildthemodelandthetimetakentotestthemodel.Additionally,theauthorsrealisedthatrunningthechosenclassifierswithparametersotherthanthedefaultoneshasanimpactonbothbalancedandimbalancedscenarios,buttheimpactissignificantlystrongerwhenusingtheimbalanceddataset.

KEyWoRDSData Pre-Processing, Imbalanced Dataset, Insider Threat, Spread Subsample, Supervised Machine Learning

1. INTRoDUCTIoN

Insiderattackspresentaconsiderableissueinthecyber-threatlandscape,with40%oforganisationslabellingthevectorasthemostdamagingattackfaced(Cole,2017)and(Moradpoor,2017).In2016,thecontainmentandremediationofreportedinsiderthreatscostaffectedorganisations4milliondollarsonaverage(PonemonInstitute,2016).Inaddition,insiderthreatsareextremelycommonamongcyber-incidents;in2015,55%ofcyber-attackswereinsiderthreatcases(Bradley,2015).Despitethehighcostandfrequentoccurrenceofinsiderthreatattacks,detectionandmitigationremainaproblem.In2018,90%ofcompaniesareregardedvulnerable(Insiders,2018).Afurther38%ofcompaniesacknowledgethattheirinsiderthreatdetectionandpreventioncapabilitiesarenotadequate(Cole,2017).Thisdisparitydemonstratesasignificantgapbetweenthecurrentadvancementsininsiderthreatdetection,andtherequirementsofbusinesses.Giventheavailabilityofcomputationalresources,itisfeasibletouseMachineLearning(ML)techniquestosolveproblemsoflargercomplexitythanhaspreviouslybeenpossible.AstrongprecedentofthiscanbeobservedinrecenthistorywiththegrowthofthefieldofBigData.ThisisalsoexemplifiedbythehistoricachievementofGoogleDeepmind(Hassabis,2017),creatingamachinelearningalgorithmwhichmasterstheimmenselycomplexboardgameGo(Silver,2016).Mostorganisationshavetheresourcestokeeplogsofemployeeinteractionswithtechnology.Byharnessingthedataproducedthroughlogging,thisinformationcouldbedigested

CORE Metadata, citation and similar papers at core.ac.uk

Provided by Repository@Napier

Page 2: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

2

intoaformatuponwhichpredictionsregardinginsiderthreatcasescouldbemade.Havingsaidthis,adatadrivenapproachtoinsiderthreatmitigationisnotanewidea,thisisafieldexperiencinganincreasingrateofpublication.However,vanguardattemptsstillreportmoreeffectivemodelsthanlatercaseswheremachinelearninghasbeenapplied(Gheyas,2016).

Inmachinelearning/dataminingprojects,animbalanceddatasetisadatasetinwhichthenumberofobservationsbelongingtooneclassisconsiderablylowerthanthosebelongingtootherclass/classes.Apredictivemodelemployingconventionalmachinelearningalgorithmscouldbebiasedandinaccuratewhenbeingemployedonsuchdatasets.Thisispurelybecausemachinelearningalgorithmsaredesignedtoimproveaccuracybyreducingtheerrorinthenetwork.Therefore,theydonotconsidertheclassdistribution,classproportion,orbalanceoftheclassesintheirclassificationprocess.Apredictivemachinelearningmodelbeingbiasorinaccuratecanbepredominantinscenarioswheretheminorityclassbelongstothemaliciousactivitiesandtheanomalydetectionisextremelycrucial.Thisincludesscenariossuchas:occasionalfraudulenttransactionsinbanks,irregularinsiderthreats,rarediseaseidentification,naturaldisastersuchasearthquakes,andperiodicmaliciousactivitiesoncriticalinfrastructures(e.g.infrequentattacksonnuclearpowerplantsorwatersupplysystemsinacity).Giventheimportanceofthesescenarios,aninaccurateclassificationbyapredictivemachinelearningmodelcouldcostthousandsoflivesorhugecosttoindividualsand/ororganisations.Thereareseveraltechniquestosolvesuchclassimbalanceproblemsusingvarioussampling/non-samplingmechanismse.g.oversampling,undersealingandSMOTEaswellasensemblemethodsandcost-based techniques. However, the importance of an imbalanced dataset has not been clearly andadequatelyinvestigatedintheliteratureparticularlyformachinelearning-basedsolutionsforinsiderthreatdetections.

Therefore,inthispaper,ourfocusisonanextremelyimbalanceddatasetofinsiderthreatswherethenumberofeventsbelongingtothemaliciousclassisconsiderablylowerthanthosebelongingtothebenignclass.Weusespreadsubsample(Weka.ClassSpreadSubsample,2018)asapopularbalancingtechnique.Thefilterallowsyoutospecifythemaximumspreadbetweentherarestclassandthemostcommonone.Forexample,givenanimbalanceddataset,youmayindicatethatthereshouldbeafigureof3:1differenceinclassfrequency.Forthis,theoriginaldatasetfirstfitsinthememory then a randomsubsampleof adatasetwill beproducedgiven the identifiedmaximumspreadbetween twoclasses. In thispaper,wespecify themaximum“spread”between therarestclass (i.e.maliciousevents)and themostcommonclass (i.e.benignevents)as“1” representinguniformdistributionbetweentwoclasses.Thisallowsustokeepallthemaliciouseventsplustheequalnumberofthebenigneventsselectedrandomlywhichresultsinhavingauniformdistributionofmaliciousandbenignevents.

Inthispaper,weraisethefollowingspecificresearchquestions:

RQ1: Does balancing the dataset during the pre-processing phase improve metrics such as:ClassificationAccuracy(CA),TimetakentoBuildthemodel(TB),TimetakentoTestthemodel(TT),TruePositive(TP)rate,FalsePositive(FP)rate,Precision(P),Recall(R),andF-measure(F)incomparisonwiththesamemetricsbutonanimbalanceddataset?

RQ2:Whataretheimportantparametersforeachclassifierthatconfiguringthemcouldhaveanimpactontheclassificationresults?

RQ3:Doeschangingtheseparameterswithdifferentvaluesimprovemetricssuchas:ClassificationAccuracy(CA),TimetakentoBuildthemodel(TB),TimetakentoTestthemodel(TT),TruePositive (TP) rate,FalsePositive (FP) rate,Precision (P),Recall (R), andF-measure (F) incomparisonwithrunningtheclassifierswiththedefaultparameters?

Additionally,thispaperprovidesacomprehensiveexplanationandinvestigationonthedatapre-processingstagewhichisacrucialpartofanydatamining/machinelearningprojects.Forthis,aclearstep-by-stepdescriptionisprovidedbytheauthors.Furthermore,weprovidedsixcomprehensive

Page 3: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

3

databreachscenarioswithfulljustificationsandexplanationswhichincludes:datatheft,privilegeduserdatabreach,endpointsecurity,shadowitrisk,datasecurity,andsensitivefoldersprotection.

Theremainderofthispaperisorganisedasfollows.InSection2,wereviewtherelatedworkbased on our research questions followed by our data analysis in Sections 3. This is trailed byimplementation, results andanalysis inSection4, conclusionand futurework inSection5, andacknowledgmentaswellasreferences.

2. RELATED WoRK

Insider threatproblemshavebeenstudiedwidelyin theexistingliterature.Thiscoversextensivecategoriessuchas:traitorandmasquerader,real-timeandnon-realtime,host-based,network-basedandhybrid(host-basedplusnetwork-based),userprofiling,useofdifferentdatasetsaswellasmachinelearning-basedsolutions.

Given that thispaper focuseson insider threatdetectionusing supervisedmachine learningalgorithmsforanimbalanceddataset,inthissection,weaddresstheexistingworkwheresupervised,unsupervised,andsemi-supervisedapproacheshavebeenemployed.Forcompleteness,wefirstbrieflyexplainfourpopularcategoriesofMachineLearning(ML)techniques:supervised,unsupervised,semi-supervisedandtransferlearningsasfollows.

SupervisedMLtechniquessuchas:NearestNeighbourmethods(e.g.k-NN),NeuralNetworks(NNs)andSupportVectorMechanism(SVM),builddataclassificationmodelfromlabelledtrainingdatawhereforeverysingleinput(e.g.x1,x2)thereisacorrespondingtarget(e.g.y1,y2).Itisaclassificationproblem,ifthetargetsarerepresentedinsomeclassesandalternativelyaregressionproblem,ifthetargetsarecontinuous.Asthenamesuggests,thereisastrongelementof“supervision”in supervised learning.Lengthy training time,high trainingcostsand the requirements for largeamountsofwell-balanceddataaresomeofthedrawbacksforthismethodology.

InunsupervisedMLtechniquessuchas:GaussianMixturemodels,Self-OrganisingMap(SOM)andGraph-BasedAnomalyDetection(GBAD),thedataisunlabelled.Thismeansforeachsingleinput(e.g.x1,x2),thereisnotargetoutputnorrewardfromitsenvironment.Thegoalofunsupervisedlearningistofindhiddenstructureorrelationamongstunlabelleddataforclusteringorcompressionpurposes.However,despitesupervisedlearning,sincethereisnoelementof“supervision”andthereisnolabel,unsupervisedlearningcan’tprovideevaluationoftheaction.Additionally,neitherpre-determinationonthenumberofclassesnorgettingveryspecificaboutthedefinitionoftheclassesispossible.

Semi-supervisedlearningtechniquessuchas:SelfTraining,GenerativeModels,GraphBasedAlgorithmsandSemiSupervisedSupportVectorMachines(S3VMs),fallbetweensupervisedandunsupervisedlearningbymakinguseofbothlabelleddata(typicallysmallamount)andunlabelleddata(typicallylargeamount)fortraining.Therefore,thegoalistoovercomeoneproblemfromeachcategory:nothavingenoughdatainsupervisedlearningandhavingnoclassificationinun-supervisedlearning.Hence,giventhatgeneratinglabelleddataisoftencostlyasitneedsalotofresourcessuchas manpower and computation while unlabelled data is generally not, semi-supervised learningsoundslikeapowerfulapproachparticularlyforbigdata.However,itsuffersfromsomelimitations.Forinstance,whenthedatapatternschangeinasemi-supervisedapproach,oldassumptionsaboutlabelleddatamayscrewupthenewunlabelleddata.

Transferlearningisamachinelearningtechniquewithanextrasourceofknowledgeinadditiontotraditionaltrainingdata.Someworkintransferlearningincludesextendingpopularclassificationand probability distribution systems such as artificial neural networks, Bayesian networks andMarkov Logic Networks. Three common measures by which transfer learning might improvelearningcomprises:1)performanceimprovementintargettaskwithtransferlearninginadditiontothetraditionallearningfromdatacomparedtothenon-transferlearningscenario,2)totaltimetofullylearnthetargettaskwithandwithouttransferlearningprocess,and3)initialperformancein

Page 4: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

4

thetargettasklearningusingonlythetransferredknowledgecomparedwiththescenariowhereonlytraditionaltrainingdataisused(Torrey,2009).

However,negativetransferavoidancewhereapplyingtransfermethodsdecreasesperformance,automationoftaskmappingtotranslatebetweentaskrepresentations,transferbetweenmorediversetasks,andperformingtransferinmorecomplextestbedsarethechallengesfacingtransferlearningtechniques.

AsummaryoftherecentworkoninsiderthreatdetectionusingmachinelearningtechniquesispresentedinTable1.Weclassifythembasedondifferentmetricssuchas:MLalgorithmse.g.Supervised(S)(Singh,2014)and(Gavai,2015),Unsupervised(U)(Parveen,2011),)Tuor,2017),(Böse,2017)and(Parveen,2012)orSemi-supervised(Se)(Gavai,2015),and(Veeramachaneni,2016).TheaddressedworkinTable1isalsoclassifiedbasedonthesourceofdatae.g.userlogsoravailableinsiderthreatdatasets,setupenvironment,experimentresultsandperformancemetricse.g.FalsePositive(FP)andFalseNegative(FN)rates.ThespecificationsrelatedtoourworkinthispaperisalsopresentedinTable1.

Authors in (Singh, 2014) employed supervised Modified k-Nearest Neighbour (k-NN) withCommunityAnomalyDetectionSystem(CADS)andmetaCADSonnon-real-timeuseraccesslogs.CADSandmetaCADSaretheframeworksthatdon’tdependonanypriorknowledgesuchasuserroleoraccesscontrolmechanismtodetectanomalies.WhileCADSincludestwostepsofpatternextractionandanomalydetection,metaCADScontains twomorestepsofnetworkconstructionassignmentandcomplexcategoryinterface.TheyclaimedthattheModifiedk-NNbeatsk-NN.However,theirpresentedresultdoesn’tshowthisandareratherunclear.Noperformancemetricse.g.FPorFNarepresentedeither.Additionally,someexamplesonuseraccesslogssuchasuser–subjectrelationshipandthesubject–categoryrelationshipwouldhavemadetheirworkclear.

Authors in (Hashem, 2016) used supervised SVM on a combination of real-time humanbiologicalsignalpatternssuchasElectroEncephaloGraphy(EEG)andElectroCardioGram(ECG)withtenvolunteersandthreescenariosthatincludedmaliciousandbenignactivities.Theycaptured86%averagedetectionaccuracywithEEGwhichwasincreasedby5%afteraddingECGsignals.However,justificationfornotincludingElectroMyoGraphy(EMG)signalsinadditiontoEEGandECGsignalshasnotbeenidentified.Additionally,theirresultsfromPrincipalComponentAnalysis(PCA)havenotbeenclearlydetailed.Giventhattheirworkreliesonuserswearingtheheadsetandsensorsallthetimetomeasurethesignalscontinuously,theriskofwearingthemonuser’shealthisalsonotclear.Additionally,notonlywearingheadsetandsensorsconstantlyisinconvenientforauserbutaninsiderislesslikelytowearthemeffectivelyknowingthathis/heractivatesaregoingtobemonitoredcontinuously.

Authorsin(Tuor,2017)employedunsupervisedDeepNeuralNetworks(DNNs)andRecurrentNeuralNetworks(RNNs)onComputerEmergencyResponseTeam(CERT)InsiderThreatDatasetv6.2(CERT)withTenserflow.GiventhattheirworkfunctionsonrawsystemlogsfromCERTinsiderthreatdataset,theyconsidereditasequivalenttorawstreamingdatathuscallingtheirworkonlineandrealtime.TheycapturedCumulativeRecall(CR-k)fordailybudgetsof400and1000whichshowedthatDNNsandRNNsoutperformPrincipalComponentAnalysis(PCA),SVMandIsolationForest(IF).Giventhattheirworkmodelsonlynormalbehaviourandusesanomalyasanindicatorofpotentialmaliciousactivities,FPratehasnotbeenclearlyaddressed.CERTdatasetisbasedonCERTmodelwhichprovidesadvantagessuchassufficientstudyoninsiders’motivationandpsychologicalorbehaviouralaspectsoftheircrimeoroffenseinthefieldofinsiderthreat.However,CERThassomedrawbacks.Forexample,thereisnoinsiderthreatclassificationfortherawsystemlogsinthisdataset.CERTmodelisalsocomplexevenafterthelatestimprovements.Thismeansforusingthismodelagroupofexpertsisalwaysneededtoperformprediction.

Authorsin(Parveen,2012)usedunsupervisedincrementalsequencelearningcombinedwithcompressionlearningforinsiderthreatdetection.Theyworkedon168realbenigntracefilesofusersinUniversityofCalgaryinadditiontotheir25maliciousartificialfiles.ThetracefilesaretheUnixCshelcollected fromfourgroupsofpeopleandcategorisedbasedon theirprogrammingskills.

Page 5: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

5

Table 1. Related work on insider threat detection using machine learning techniques

Author, year ML: S/ U/ Se

ML algorithms Source of Data Environment, Configurations,

Framework

Experimental results

Captured performance

(e.g. FP and FN)

Real Time?

(Singh,2014) S Modifiedk-NearestNeighbour(k-NN)

Useraccesslogs CommunityAnomalyDetectionSystem(CADS)&metaCADSframeworks

ModifiedkNNbeatsk-NN

N/A No

(Hashem,2016) S SupportVectorMachine(SVM)

Tenvolunteersequippedwith:EmotivEPOCheadsetandOpenBCIsensors

Electroencephalography(EEG)andtheelectrocardiogram(ECG)signals

91%averagedetectionaccuracy

Presented:Accuracy,Precision,Recall,F-measure,andError-rate

Yes

(Parveen,2011) U Ensemble-Graph-BasedAnomalyDetection(E-GBAD)withStreamMiningandGraphMining

1998LincolnLaboratoryIntrusionDetectionDataset

Systemlogsfromthedataseteachspecifiesbyatokenwitheightattributes

E-GBADapproachismoreeffectivethantraditionalsingle-modelapproach

0%FNandalowerFPratethansingle-model

No

(Tour,2017)

U DeepNeuralNetworks(DNNs)&RecurrentNeuralNetworks(RNNs)

ComputerEmergencyResponseTeam(CERT)InsiderThreatDatasetv6.2

UsedTenserflow.Tunedthedatabasedonrandomhyper-parametersearche.g.learningrateof0.01forbothDNNandRNN

DNNsandRNNsoutperformPrincipalComponentAnalysis(PCA),SVMandIsolationForest(IF)

Presented:CumulativeRecall(CR-k)fordailybudgetsof400and1000

Yes

(Parveen,,2012) U Compressionandincrementallearning

UsertracefilesoftheUnixCshelinUniversityofCalgaryinadditiontotheirownfiles

168realbenignfilescollectedfromfourgroupsofpeopleinadditiontotheir25artificialmaliciousfiles

TheirapproachperformedwellwithlimitednumberofFPcomparedtostaticapproach

Presented:AverageFPandaverageTPrates

No

(Bose,2017) U Unsupervisedk-NNandkdimensional(k-d)tree

GeneratedbyDARPAADAMSprogram

Real-timeAnomalyDetectionInStreamingHeterogeneity(RADISH)system

k-dtreeismuchfasterthankNN.RADISHperformanceanditsaccuracypresented

N/A Yes

(Veeramachaneni,2016)

Se MatrixDecomposition,NeuralNetworks,andJointProbability

Athree-monthreallogdataproducedtotalof3.6billionlinesgeneratedbyenterpriseplatform

Components:ananalyticsplatform,anunsupervisedrareeventmodeller,afeedbackplatform,andasupervisedlearningmodule

Thesystemlearnstodefendagainstunseenattacks

Presented:TPandFPrates;TPrateimproved;FPratesreducedwhentimeprogresses

Nearlyrealtime

(Gavai2015) Se UnsupervisedModifiedIsolatedForest+SupervisedRandomForests

Areal-worlddatasetnamed’Vegas’forbenignactivitiesinadditiontoartificiallyinjectedinsiderthreatevents

Anunsupervisedandasupervisedapproach:toidentifyabnormalityandtolabelquitters,respectively

Unsupervised:ROCscoreof0.77%,supervised:accuracyof73.4%

Presented:ROC,RecallandPrecision

No

Thiswork,2019 S Supervisedmachinelearningalgorithms:J48SVM,NaïveByes(NB),andRandomForest(RF)

Sixdemoscenariosobtainedfrom(ZoneFox,2017)

Capturedeightfeaturesforfiveusersinfourconsecutivedaysforhourstoconstructfiveuserprofiles

Balancingthedatasetdoesn’timproveperformancemetrics;theimpactofusingdifferentparametersforclassifiersissignificantlystrongeronimbalanceddataset

Presented:Accuracy,TimetakentoBuildthemodel(TB),TimetakentoTestthemodel(TT),TruePositive(TP)rate,FalsePositive(FP)rate,Precision(P),Recall(R),andF-measure(F)

No

Page 6: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

6

TheycapturedFPandTPaverageratesthatshowstheirworkperformedwellcomparedtothestaticapproach.However,themaliciousanomalousdataisartificialandcraftedbytheauthor.Additionally,thispublicdataset(Greenberg,1988)isnowdated.

Authorsin(Veeramachaneni,2016)proposedanearlyreal-timesupervisedandunsupervisedsystemwithananalyst-in-the-loopagainstunseenattack.Theunsupervisedpartoftheirsystemlearnsamodelthatcanidentifyextremeandrareeventsindata.Therareeventsthenpresentedtothesystemanalystwholabelsthemaseithermaliciousorbenign.Attheend,thelabelleddataisprovidedtothesupervisedlearningpartoftheirsystemwhichproducesamodelthatcanpredictwhethertherewillbeattacksinthefollowingdays.Ontheunsupervisedpartoftheirmodel,theyemployedMatrixDecomposition,NeuralNetworksandJointProbabilitytodetectrareeventseachofwhichgeneratesascoreindicatinghowfaracertaineventisfromothers.Theyusedatotalof3.6billionrealloglinesproducedwithin3monthsandpresentedthattheTPrateimprovesbyalmostthreetimesandFPratereducesbyfivetimesastimeprogresses.

Authorsin(Gavai,2015)employedunsupervisedModifiedIsolationForest(IF)andsupervisedRandomForestalgorithmstodiscoverinsiderthreatsbasedonemployees’socialandonlineactivitydata.Theirapproachincludesgeneratingsomefeaturesbasedonemailcontent,work-practiceandonlineactivitywhichisthenusedintheunsupervisedpartoftheproposaltoidentifysuspiciousbehaviour.Thesefeaturesarealsousedinthesupervisedpartoftheirproposalinconjunctionwithquittinglabelstodevelopaclassifier.Theyusedareal-worlddatasetnamedVegasforbenignactivitiesinadditiontotheirownartificiallyinjectedinsiderthreatevents.Theeventsincludepatternsforemailcommunication,webbrowsing,emailfrequency,andfileandmachineaccess.TheyobtainedaROCscoreof0.77%fortheunsupervisedpartoftheirworkandaclassificationaccuracyof73.4%forthesupervisedpartshowingthattheirapproachissuccessfulindetectinginsiderthreatactivityaswellasquittingactivity.However,themaliciousanomalousdataisartificialandcraftedbytheauthors.

Theworkpresentedinthispaperdiffersfromtherelatedworksdescribedaboveaswefocusonemployingmachinelearningtechniquesonextremelyimbalanceddatasetinwhichthemaliciousclassisintheminorityandthebenignclassisinthemajority(i.e.themaliciouseventsarelessthan1.3%ofthetotaldataset).Giventhattheimportanceoftheclassimbalanceproblemhasnotbeenaddressedintheexistingliteraturefortheinsiderthreat,ourmethodologyforexploringtheimpactofdatabalancingderivedfrom(Galar,2011),(Hanifah,2015)&(Pavlov,2010)wheredifferentbalancingtechniqueshavebeendiscussedincludingundersampling,oversamplingandhybrid-basedapproachesfromwhichweemployapopularundersamplingtechniquecalledspreadsubsample.Ourworkinthispaperisbuiltuponourpreviouswork(Moradpoor,2017)whereweusedSelf-OrganisingMap(SOM)inconjunctionwithPrincipalComponentAnalysis(PCA)forinsiderthreatdetectiononthesameimbalanceddatasetthatweuseinthispaper.However,inourpreviouswork,thedatasetwasunlabelled.Giventhatitislesslikelytohaveabalanceddatasetinareal-worldscenario,knowinghowtodealwithanimbalanceddatasetisnotonlycrucialbutalsomorerealistic.

Fortheclassifiers,weconsiderJ48decisiontree,SupportVectorMachine(SVM),NaïveBayes(NB), andRandomForest (RF)algorithmsand forperformancemetricswe studyClassificationAccuracy(CA),TimetakentoBuildthemodel(TB),TimetakentoTestthemodel(TT),TruePositive(TP)rate,FalsePositive(FP)rate,Precision(P),Recall(R),andF-measure(F).Theseperformancemetricsaremainlyderivedfrom(Buczak,2015)whereacomprehensivesurveyofdataminingandmachinelearningtechniquesforcybersecurityintrusiondetectionhasbeenfullydiscussed.

Additionally,weprovideacomprehensivereviewandexplanationsonthedatapre-possessingphase.Thisphase,whichismissingfromtheexistingworkonML-basedinsiderthreatdetectione.g.from(Singh,2014),(Parveen,2011),(Tuor,2017),(Bose,2017),(Hashem,2016),(Gavai,2015),(Parvenn,2012)and(Veeramachaneni,2016),isanextremelyimportantpartofanymachinelearningprojectsgiventhatittransfersarawororiginaldatasettoanunderstandableandmeaningfulformat.Thispartofourworkisencouragedby(Hamed,2018)whereataxonomyofdatapre-processingtechniquesforintrusiondetectionsystemshasbeendiscussed.Inthispaper,thedatapre-processing

Page 7: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

7

phaseincludesoutlieridentificationfromdatacleaningstage,datadecompositionanddataconversionfromdatatransformationstage,anddatabalancingfromdatareductionstage.Thecomprehensiveexplanationsondatapre-processingphaseonextremelyimbalanceddatasetinthispapercanbeusedasaguideandcanbeemployedonanydatamining/machinelearningprojects.

Furthermore,despite(Singh,2014),(Parveen,2011),(Tuor,2017),(Bose,2017),(Hashem,2016),(Gavai,2015),(Parvenn,2012)and(Veeramachaneni,2016),thispapercontainssixcomprehensivedemosetupswhichcoversalltheseriousdatabreachscenarios:datatheft,privilegeduserdatabreach,endpoint security breach, shadow it risk, data security, and sensitive folders breach. Besides, tomakethescenariosmorerealistic,itincludesdifferentarrangementsforthreegroupsofstaffwithinanorganisation:permanentstaff(x3scenarios),temporarystaff(x1scenario)andthird-partystaff(x2scenarios).Ourmethodologyfordefining,justifying,anddevelopingthesixscenariosaboveisenthusedby(Lindauer,2014)wheregeneratingtestdataforinsiderthreatdetectorshasbeendescribed.Theworkin(Legg,2015)alsoinspiredourworkintermsofusinguserandrole-basedprofilesforautomatedinsiderthreatdetection.

3. ANALySING THE DATA

Inthispaper,sixdemoscenarioshavebeenidentifiedwhichhelptoshapethedatasetusedinourexperiments. They have been developed by (ZoneFox, 2017), which is a market leader in UserBehaviourAnalytics,toensurefamiliarisationoftheinsiderthreatsinanorganisation.Inthissection,wediscuss:thesixdemoscenarios,theoriginaldatasetthatweusedforourexperimentsandthedatapre-processingstage(outlieridentification,datadecompositionanddataconversion).

3.1 DEMo ScenariosOursixdemoscenarios includedifferent setups for threegroupsof staff:permanent, temporaryandthirdparty.Thiscontainsthreescenariosforpermanentstaff:DataTheft,EndpointSecurityProcessingandShadowITRisk,onescenariofortemporarystaff:PrivilegedUserDataBreachandtwoscenariosforthird-partymemberofstaff:DataSecurityandProtectSensitiveFolders.Toclarifyapermanentstaffisalong-termmemberofstaffwhoworksforinstanceincompany’sengineeringorsalesteame.g.Charlotte,RebeccaandLaura.Atemporarystaffisashort-termmemberofstaffwhohasbeenemployedforthebusyperiode.g.Timmy.Athird-partymemberofstaffisanindividualwhohasbeenemployedtoworkwithoneoftheclientsystemse.g.Colin.Thenamesareallfictionalandhavebeenusedforthesakeofmakingourdemoscenariosclear.AllsixscenariosarepresentedinTable2anddefinedasfollows.

3.1.1 Demo Sceanrio 1: Data TheftADataTheft foranorganisationcommonly includesstealingsensitive informationrelated to itsstaff, clients or business e.g. usernames, passwords, credit card information, medical records orbusinesssecrets.InDemoScenario1,wefocusonthedatatheftthreatfromemployeeswithinagivenorganisationbyfollowingtheInsiderThreatKillChain(ITKC).ITKCidentifieshumanresourcesasthegreatestrisktoorganisationsanddiscusseshowmembersofstaffwithinorganisationscanworktogethertohelpaverttheserisksbeforetheybecomeaproblem.Toachievethis,weconsiderfourgroupsofpeoplewithintheorganisation:1)peoplewhointendedtoleavetheirjobandalreadyhandedintheirnotice,2)peoplewholookaroundfileserversrepeatedly,3)peoplewhodownloadbackupsoftwareontheirdesktopcomputersand4)peoplewhocopyzipfilestotheirremovabledevices.Forexample,asitisshowninTable2DemoScenario1,Charlotte,whoisapermanentmemberofstaffworkinginthecompany’sengineeringteam,backsupfilestoaremovablediskdrive.

Page 8: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

8

3.1.2 Demo Scenario 2: Privileged User Data BreachIntrudersneedprivilegedaccess,forinstanceadmin’susernameandpassword,tocarryoutmaliciousactivitiesandproceedattackstothenextphase.Mischievousactivitiessuchasinstallingmalicioussoftware,e.g.malware,ransomwareorbackdoor,stealingsensitiveinformation,e.g.usernamesandpasswords,orevendisablinghardwareand/orvitalsoftwareonavictim’scomputer,e.g.anti-malwareoranti-spyware.

Thatiswhyprivilegeduseraccountsareinsuchhighdemandsbyintruders.Infact,therecanbesomanysharedprivilegedaccountswithinanorganisationwithnomanagementknowledgeonwhereallthoseaccountsresideorwhohasaccesstothem.InVerizonDataBreachInvestigationsReportfor2016(Enterprise,2017),63%ofallbreacheswereduetousingweakorcommonpasswordswhile53%ofthemweredowntothemisuseofprivilegedaccounts.Privilegedaccountsinclude:localadminaccounts,privilegeduseraccounts,domainadminaccounts,emergencyaccounts,serviceaccountsandapplicationaccounts.Anefficientprivilegemanagementsystemwouldautomaticallyidentifytheseaccountsandbringthemunderamanagementsystemumbrella.InDemoScenario2,pinpointingprivilegeduserdatabreachisdonebygoingthroughthefollowingsteps:1)identifyingprivileged user accounts, 2) identifying sensitive files that could only be accessed using thoseaccounts,and3)monitoringtheaccesstothosesensitivefiles.Forexample,asitisshowninTable2DemoScenario2,Timmy,whoisatemporarymemberofstaffwhohasaccesstoprivilegeduseraccounts,accessesafile,i.e.minutesofcompany’sFebruarymeeting,onthenetworkfilesystemthathedoesnotneedaccessto.

3.1.3 Demo Scenario 3: Endpoint Security ProcessingInanorganisation,endpointsincludedesktopcomputers,mobiledevices(e.g.laptops,smartphonesandtablets),printers,scannersorevenbarcodereaders.Endpointsecuritymanagementisanetworksecurityapproachthatrequiresendpointdevicestofollowthesecuritypolicyofagivenorganisation.Thisneedstobemetbeforeaswellasaftergrantingaccesstonetworkresources.InDemoScenario3,weconsiderendpointsecurityprocessingbymonitoringthestaffactivitiesondesktopcomputersacrosstheorganisation.Forinstance,turningoffordisablingasystem’santi-malwaree.g.popup

Table 2. Six demo scenarios of users

Demo scenarios Staff type Staff name

Application Resource Description

Demoscenario1:Data Theft

Permanent Charlotte bbackup.exe rm:\\e:\\myusb\\backup.zip Charlottebacksupfilestoaremovablediskdrive.

Demoscenario2:Privileged User Data Breach

Temporary Timmy N/A nfs:\\......\\fileshare2\\boardminutes\\minutes-february.txt

Timmyaccessesafileonthenetworkfilesystemthathedoesnotneedtoaccessto.

Demoscenario3:Endpoint Security Processing

Permanent Laura Savservice.exe(avsoftware)

N/A Lauradeactivatestheanti-virussoftwareonhercomputer.

Demoscenario4:Shadow IT Risk

Permanent Rebecca dropbox.exeSkype.exe

c:\\users\\rebecca\\dropbox\\plan2.docc:\\users\\rebecca\\dropbox\\plan1.docnfs:\\....\fileshare1\\engineeringplans\\plan1.docnfs:\\....\fileshare1\\engineeringplans\\plan2.doc

RebeccausesDropboxtoperformunauthorisedbackupstotheCloud.RebeccausesSkypetoperformunapproveduploadsfromnetworkfilesystemtoCloud.

Demoscenario5:Data Security

Third-party

Colin N/A nfs:\\......\fileshare1\\engineeringplans\\plan1.docrm:\\f:\\copyto\\remdrive\\ipdata.txt

Colinaccessesafileonthenetworkfilesystemthatheisnotsupposedtoaccessandcopiesittoaremovablediskdrive.

Demoscenario6:Protect Sensitive Folders

Third-party

Colin N/A nfs:\.....\fileshare1\PatientData\AliceBrownMedicalRecord.txt

Colinaccessessensitiveinformation(AliceBrown’smedicalrecord)onthenetworkfilesystemthatheshouldhavenoneedtoaccessto.

Page 9: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

9

blockers,anti-spyware,anti-spam,host-basedfirewallsandanti-viruses.Forexample,asitisshowninTable2DemoScenario3,Laura,whoisapermanentmemberofstaffworkinginthecompany’ssalesteam,deactivatestheanti-virusonherdesktopcomputer.

3.1.4 Demo Scenario 4: Shadow it RiskShadowITreferstosoftwareorapplicationspurchased,downloaded,installed,usedormanagedoutsideorwithouttheknowledgeofanorganisation’sITdepartment.TheycanbedefinedastheITassetsthatareinvisibletoanorganisation’sITdepartment.ShadowIThasgrownexponentiallyinrecentyears.ThisisdowntothegoodqualityofapplicationsintheCloud,mobiletechnologygrowthandtherapiddevelopmentinSoftwareasaService(SaaS),suchas:Dropbox,CiscoWebEx,GoogleApps,Salesforce,Skype,andMicrosoftOffice365.ASaaS,whichisalsoknownasCloudsoftware, on-demand software or hosted software, is a way of delivering applications over theInternet.AgivenSaaSmayormaynotofferstrongsecurityprotectionse.g.identitymanagement,authentication,accesscontrol,securebackuppractices,datamaskingordataencryption.Therefore,itcanexposeanorganisationand/oritsaffiliatestodatalossriskandsecurity-relatedthreatssuchasinsiderthreats.InDemoScenario4,weconsiderShadowITrisk,whichisimposedbyemployeeswithintheorganisationusingunknown/unauthorisedSaaSe.g.DropboxorSkype.Forexample,asitisshowninTable2DemoScenario4,Rebecca,whoisapermanentmemberofstaffworkinginthecompany’sengineeringteam,installsDropboxtoperformunauthorisedbackupsandSkypetoaccomplishunapproveduploadstotheCloud.

3.1.5 Demo Scenario 5: Data SecurityDataSecurityreferstoprotectivemeasuresappliedtopreventunauthorisedaccessandcorruptiontoresourcessuchasendpointdevices(e.g.computers,printersandtablets),databases,websites,andcomputerfiles.Backups,dataencryption,authenticationanddatamaskingarethecommonlyuseddatasecuritytechniques.Forexample,datamaskingisamethodofprotectingtheactualdatabycreatingastructurallysimilarbutfakeversionofanorganisation’sdataforpurposessuchastrainingorsoftware/algorithmtesting.InDemoScenario5,weconsidermonitoringresourcesthatathird-partymemberofstaffisnotsupposedtoaccessormakeacopyfrome.g.accessingasensitivedocumentorcopyingintellectualpropertyfilestoaremovabledisk.Forexample,asitisshowninTable2DemoScenario5,Colin,whoisathird-partymemberofstaff,accessesandcopiesintellectualpropertydatatoaremovablediskdrivethatheisnotsupposedtodo.

3.1.6 Demo Scenario 6: Protect Sensitive FoldersIngeneral,datacanbecategorisedaseitherpublic,restrictedorprivate.Thepublicdatasuchas:pressreleases,courseinformationorresearchpublicationscanbeavailabletoanyonewithlittleornocontrolstoprotecttheirconfidentiality.Therestricteddatasuchas:creditcardnumbers,passwordsor personal medical information are those for which the unauthorized disclosure, alteration ordestructionofthemcouldcauseasignificantlevelofrisktoanorganisationoritsaffiliates.Theprivatedatasuchas:homeaddress,birthdate,gender,religiousorsexualorientationarethoseforwhichtheunauthorizeddisclosure,alterationordestructionofthemcouldresultinamoderatelevelofrisktoanorganisationoritsaffiliates.InDemoScenario6,wefocusonprotectingfolderswithrestricteddatae.g.foldersthatcarrymedicalrecordsforindividualsmaintainedbytheorganisation.Forexample,asitisshowninTable2DemoScenario6,Colin,whoisathird-partymemberofstaff(i.e.acontractor),accessesrestricteddata(i.e.oneofthestaff’smedicalrecords)whichheshouldhavenoneedtoaccess.

3.2 original DatasetInthissection,weexplainthedatasetthatweusedforourexperiments.Asmentionedbefore,theoriginaldatasethasbeengivenby(ZoneFox,2017)andcoversallthesixscenariosthatwedefined

Page 10: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

10

in the previous section. It includes Charlotte, Rebecca, Laura, Timmy and Colin’s user profilescapturedonfourconsecutivedays.Thedatasetalsocontainstheadministrator’snetworkactivities.Asdiscussedbefore,eachuseriseitherapermanentmemberofstaff(e.g.Charlotte,Rebecca,LauraorAdministrator),atemporarymemberofstaff(e.g.Timmy)orathird-partymemberofstaff(e.g.Colin).Theoriginaldatasetisin.CSVformatandcontains2643linesofrawdatawhichincludessixfeatures:Date-Time,machine_ID,user_ID,application,actionandresource.Eachlineof thedatasetidentifiesanactiondonebyoneoftheusersintheuserpool(i.e.Charlotte,Rebecca,Laura,Timmy,ColinorAdministrator).Forexample,onelineofthedatasetspecifiesthaton2016-02-23at16:30:27(Date-Time),employeeA(user_ID)usedcomputerB(machine_ID)torunbackup.exe(application)andwrite(action)thebackupintopathD(resource).

3.3 Data Pre-ProcessingIn this section, we explain the pre-processing phase employed on our original dataset given by(ZoneFox,2017).Thisphaseisanimportantpartofanymachinelearningprojectsgiventhatittransfersarawororiginaldatasettoanunderstandableandmeaningfulformat.Rawdataisoftenincomplete,noisy,inconsistenceand/orlackingcertainbehaviours,attributesorstylesandislikelytogenerateorcompriseerrorsiffedintactintomachinelearningalgorithms.Datapre-processingisaprovenmethodofresolvingtheseissues.Itcomprisesstagessuchas:datacleaning,dataintegration,datatransformation,datareductionanddatadiscretisationeachcontainingvarioustasks.Forexample,data transformation comprises tasks such as normalisation and aggregation while data cleaningincludesfillinginmissingdata,outlieridentification,outlierremovalandinconsistencyresolution.

Inthispaper,thedatapre-processingphaseincludesfourtasksof:outlieridentificationfromdatacleaningstage,datadecompositionanddataconversionfromdatatransformationstage,anddatabalancingfromdatareductionstageasfollows.

3.4 Data CleaningInanydatascienceproject,datacleaningisacriticalpartofthedatapre-processingphase.Thisincludestaskssuchas:fillinginmissingvalues,identifyingoutliers,smoothingoutnoisydataorcorrectinginconsistentdata.Forexample,forfillinginmissingvaluesalearningalgorithmsuchasBayesordecisiontreecanbeusedtopredictthem.Additionally,domainknowledgeorexpertdecisioncanbeemployedtocorrectinconsistentdata.Inthispaper,weusedoutlieridentificationtaskfromdatacleaningstageasfollows.

3.4.1 Outlier IdentificationIn data science, “outliers” are values that “lie outside” the other values e.g. in the scores of:{3,22,25,23,29,33,85},3and85areoutliergiventhattheyarefarawayfromthemaingroupofdata:{22,25,23,29,33}.Indatamining,outlieridentification,whichisalsoknownasanomalydetection,referstoidentificationofitems,eventsorbehaviourswhichdonotfollowanexpectedpattern.Itisanobservationofthedatathatdeviatesfromotherobservationssomuchthatitawakenssuspicionsthatitwasgeneratedbyadifferentand/orunusualmechanism(Williams,2002).Inthedatapre-processingphase,outlieridentificationisapartofthedatacleaningstageandincludestaskssuchasbinning,clusteringandregression.Inthispaper,outliermeansinsiderthreatandoutlieridentificationreferstodetectionofdigitaltasksperformedbyemployeeswhichtheyarenotsupposedtodo/ortheyshouldhavenoneedtodo.Inordertoperformoutlieridentificationonouroriginaldataset,wesurfedthrougheacheventmanuallyandmarkeditaseither1,representinganoutlierevent,or0,representinganon-outlierevent.Thisisdecidedbyconsideringthesixdemoscenariosexplainedintheprevioussection(i.e.DataTheft,EndpointSecurityProcessing,ShadowITRisk,PrivilegedUserDataBreach,DataSecurity,SensitiveFoldersProtection)alongwiththeindividual’srolewithintheorganisation.Theoriginaldatasetisin.CSVformatandincludes2643linesofuseractivitiesforfiveindividuals:(i.e.Charlotte,Rebecca,Laura,TimmyandColin).Theoutlieridentificationtask

Page 11: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

11

resultedinidentifyingatotalof33outlieractivitiesoutof2643eventswithintheoriginaldataset.AsitisdepictedinTable3,totaloutliereventsof6,5,11,1and10belongtoCharlotte,Rebecca,Colin,LauraandTimmy,allrespectively.

Given that this paper is based on a supervised machine learning approach, the outlieridentificationtaskaddsanextrafeatureof“Outlier”toourdatasetwhichisalsoknownas“label”.Outlieridentificationtaskorlabellingtheentiredatasetisdonetoprovideaplatformtoevaluatetheperformanceofoursupervisedmachinelearningapproach.Italsoassistswithourfutureworkwhichisbasedonanunsupervisedmachinelearningapproach.Additionally,labellingtheentiredatasetwillassistusinimplementingasemi-supervisedapproachforourpossiblefutureworkwhichisanimplementationofasupervisedmethodinadditiontotheunsupervisedmethodthatcouldhaveapotentialtoimprovetheaccuracyrateforinsiderthreatdetection.

3.5 Data TransformationMachinelearningalgorithmswereunlikelytoprocessanydatacorrectlyand/orproduceanyaccurateresultssuchaspredictionsifthedatadoesnotfollowasimilartype.Therefore,weemployeddatatransformationstageonouroriginaldatasetduringthepre-processingphasetotransformthemintoasimilardatatype.Thisincludestwostepsofdatadecompositionanddataconversionasfollow.

3.5.1 Data DecompositionInadataset,decomposingsomevaluesintomultiplepartswillhelpamachinelearningalgorithmincapturingmorespecificrelationships.Forinstance,datadecompositiononafeaturesuchas“date”representedas“Tues;01.04.2017”into:dayofaweekandmonthofayearmayprovidemorerelevantinformation.Infact,datadecompositionistheoppositeofdatareductiongiventhatitwilladdmoredatatotheoriginaldataset.

Inthispaper,weemploydatadecompositiononDate-Time,application,actionandresourcefeaturesinourdatasettocapturemorespecificrelationshipbetweentheindividualuserbehaviourandinsiderthreat.

Forinstance,inouroriginaldataset,Date-Timerepresentedas2016-02-23T16:26:33Zindicatinga user event that has happened on 23th of February 2016 at 16:26:33 hours. This includes twocharacters:T,whichcanbereadasanabbreviationforTime,andZ,whichstandsforzero-timezoneasitisoffsetby0fromtheCoordinatedUniversalTime(UTC).

AfterdatadecompositionDate-Timefeaturebreakdownto:Year(2016),Month(02),Season(Winter),Day (23),WeekDay(Tuesday),WeekPortion (Middleofweek),Hours (16),Minutes(26),Seconds(33)andTimeofDay(Afternoon).Likewise,theapplicationfeatureisdecomposedtoapplicationtype(i.e.systemapplicationoruserapplication)andapplicationID,actionfeatureisdecomposedtoactiontype(i.e.relatedtoprocessesorrelatedtofiles)andactionIDandresourcefeatureisdecomposedtoresourcelocation(i.e.localcomputer,localnetwork,removablememoryorCloud),fileextension(e.g..txt,.zipor.bkc)andfolderdepth(e.g.1,2or3).

Table 3. Outlier identification

Employee Name Total Events 441/2643454?? Outlier Events 33/2643

Charlotte 146 6

Rebecca 75 5

Colin 57 11

Laura 135 1

Timmy 28 10

Page 12: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

12

3.5.2 Data ConversionOften,machinelearningalgorithmsrequirethedatainspecificswaysbeforefeedingthemintothemodel.Thiscanbedoneduringthedataconversiontaskwhichistheconversionofdatafromoneformattoanotherformate.g.fromcategoricaldatatonumericvalues.Categoricaldataarethedatathatcontainlabelvalues,whichisoftenlimitedtoafixedset,ratherthannumericalvalues.Someexamplesinclude:“weekday”variablewiththevaluesof:“Monday”,“Tuesday”and“Wednesday”or“season”variablewiththevaluesof:“spring”,“summer”and“autumn”.Therefore,giventhatmostmachinelearningalgorithmscan’toperateoncategoricaldatadirectlyandrequireallinputandoutputvariablestobenumeric,integerencodingalsoknownasnumericconversionandone-hotencodingcanbeused.Inintegerencoding,eachcategoricaldataisassignedanintegervaluee.g.“Monday”is1,“Tuesday”is2,and“Wednesday”is3.However,inone-hotencoding,abinaryvariableisaddedforeachvaluee.g.inthe“weekday”variableexampleabove“Monday”is001,“Tuesday”is010,and“Wednesday”is011.

Inthispaper,thedataconversiontaskhasbeenemployedon13outof20featuresofourdataset.This includesnumeric conversion for:Date-Time, season,weekday,weekportion, timeof day,machine_ID,user_ID,application_type,application_ID,action_type,action_ID,resource_locationandfileextension.

Forexample,inourmaindataset,theDate-Timefeatureforeacheventcontainsstandarddateandtimeformatincludingbothnumericandtextcharacter.Forinstance,2016-02-23T16:26:33Zrepresentsausereventthathappenedon23rdofFebruary2016at16:26:33hours.Thisincludestwocharacters:TforTime,andZforzero-timezoneasitisoffsetby0fromtheCoordinatedUniversalTime(UTC).ForDate-Timeconversion,wefirstsplitthisfeaturetodateandtime.Thisgivesus:2016-02-23Tand16:26:33Zforourrunningexampleabove.WethenremovedTandZcharactersandformattedthedateto:23/02/2016whiletimestaysthesame:16:26:33.WethenconverteddateandtimetoaUNIXtimestampwhichresultsin1456185600and59193forthedateandtime,respectively.Inthelaststep,wecombinedthemtogetherwhichgivesus:1456244793.UNIXtimestampisthenumberofsecondsbetweenaparticulardateandtheUnixEpochonJanuary1st,1970atUTC.WeruntheUnixtimestampconversionontheDate-Timefeaturefortheentiredataset.

Besidesthis,aswediscussedintheprevioussection,theDate-Timefeaturewasfurthersplitintoyear,month,season,day,weekdayandweekportion(forDatefeature)andhours,minutes,secondsandtimeofday(forTimefeature)duringthedatadecompositiontask.Therefore,weallocatea1-4rangetoseason,1-7toweekday,1-4toweekportionand1-4totimeofdayinthedataconversiontask.However,year,month,day,hours,minutesandsecondsremainunchangedduringthistask.

Furthermore,inouroriginaldataset,eachmachine_IDisacombinationofnumbers,uppercaseandlowercaseletterse.g.4RcZBZz.Wedefined15,000–19,000rangeformachine_IDdataconversionfromwhichanintegervalueisassignedtoeachcomputer.Forexample,16002hasbeenassignedtoacomputerwithanidof4RcZBZz.Likewise,1000–1500rangeisallocatedtotheuser_IDsinourdataset.ThismeansauniquenumericalvalueforAdministrator,Charlotte,Rebecca,Laura,TimmyandColine.g.1301hasbeenassignedasauser_IDtoColin.

Similarly,20–99rangehasbeenassignedtoapplication_ID,200-499toaction_ID,0-4toresource_locationand0-19tofileextension.Moreover,intermsofapplication_typeandaction_type,0representssystemsapplicationandprocessrelatedactionsand1representsuserapplicationandfilerelatedactions,bothrespectively.

Toputitinanutshell,sevenfeaturesofyear,month,day,hours,minutes,secondsandfolderdepth,remainunchangedduringthedataconversiontask.AllthearrangementsfordataconversionhavebeenidentifiedinTable4.

3.6 Data ReductionThisstageofthedatapre-processingphaseincludesstepsasfollows(Datapre-processing,2018).“Reducingthenumberofattributes”:forinstance,removingirrelevantattributesorusingprinciple

Page 13: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

13

componentanalysistosearchforalowerdimensionalspacethatcanbestrepresentthedata.“Reducingthenumberofattributevalues”:forexample,byclusteringwheresimilarvaluesgroupedinclusters.“Reducingthenumberofdatasamples”:forinstance,withinthemajorityclasstoreducethedegreeofimbalancedatadistributionandproduceabalanceddataset.Thisisalsoknownasdatabalancing.Inthispaper,weonlyemploydatabalancingduringthedatareductionstageasfollows.

3.6.1 Data BalancingIn machine learning and data mining, the imbalanced class distribution is a scenario where thenumberofinstancesbelongingtooneclassissignificantlylowerthanthosethatbelongtotheotherclasses.Insuchascenario,aswehaveinthispaper,theclassificationcouldbebiasedandinaccurate.Thishappensduetothefactthatmachinelearningalgorithmsareusuallydesignedtoincreasetheaccuracybyreducingtheerror.Therefore,toachievethis,theydonotconsidertheclassdistribution,classproportion,orbalanceofclasses.Therearevariousapproachesforsolvingclassimbalanceproblems.Thisincludestwoareasof:1)resamplingand2)classifiermodifications.Inthispaper,wefocusonthefirstcategorywherewereassembletheoriginaldatasettoprovidetwobalancedclasses(i.e.maliciousandbenign).Forthis,weuseWeka’sSpreadSubsamplefeature(Weka.ClassSpreadSubsample,2018)wheretheoriginaldatasetmustfirstfitentirelyinthememory.Thisincludestheentireimbalanceddatawhichcontainsmaliciousandbenignevents.Thenweneedtospecifythemaximum“spread”betweentherarestclass(i.e.“malicious”whichislabelledasclass“1”withthetotaleventsof“33”)andthemostcommonclass(i.e.“benign”whichislabelledasclass“0”withthetotaleventsof“2643”).Forinstance,“0”fordistributionSpreadofSpreadSubsamplemeansnomaximumspread,“1”meansuniformdistributionand“10”meansallowatmosta10:1ratiobetweentheclasses.Inthispaper,wechose“1”representinguniformdistributionbetweentwoclasses.Thisallowsustokeepallthemaliciousevents(i.e.33maliciousevents)plustheequalnumberofthebenigneventsselectedrandomly(i.e.33benignevents).Attheend,wewillhave66eventsintotalwithauniformdistributionofmaliciousandbenignevents(i.e.33maliciousand33benignevents).

4. IMPLEMENTATIoN, RESULTS AND ANALySIS

Inthissection,weexplainourcapturedresultsafterrunninganumberofpopularmachinelearningalgorithmswithdifferentparametersonourdatasets.Thisisdoneontwocopiesofourdatasets:balancedandimbalanced.Bothdatasetshavebeenpassedthroughthedatapre-processingphaseincluding:outlier identification,datadecompositionanddata conversion.However, as thenamesays,thebalanceddatasetincludesanextrastepofdatabalancing.WeuseWeka3(Weka,2018)inourexperimentswhichisapopularandpowerfultoolfordataminingandmachinelearning.Ithasacollectionofmachinelearningalgorithmsandcontainstoolssuchasdatapre-processing,classificationandvisualization.Wekaisalsowell-suitedfordevelopingnewmachinelearningschemes.Theresultscapturedfromourmachinelearningschemesareanalysedasfollows.

4.1 Supervised Machine Learning ResultsSupervisedlearningcanbecategorisedintoclassificationproblemswhentheoutputisaclassandregression problem when the output is a real value. Some popular supervised machine learningalgorithmsare:NaiveByes(NB)forregressionandclassificationproblems,LinearRegression(LR)forregressionproblems,RandomForest(RF)forregressionandclassificationproblems,SupportVectorMachines(SVM)forclassificationproblems,andNeuralNetworks(NNs)forregressionandclassificationproblems.Givenourdatasetsfromtheearliersection,theprobleminthispaperisaclassificationproblemaswewantoursupervisedapproachtoexplicitlyidentifyagiveneventeitherasbenign(i.e.itbelongstoclass“0”)ormalicious(i.e.itbelongstoclass“1”).Inthissection,weemployanumberofpopularmachinelearningalgorithmsforinstanceJ48decisiontree,SupportVector

Page 14: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

14

Machine(SVM),NaïveByes(NB),andRandomForests(RF)configuredwithdifferentparametersalongwithseveraltechniques(e.g.singleclassifiersandbalancedvs.imbalanceddatasets).

Itisworthmentioningthat,forbothbalancedandimbalanceddatasets,weraneachexperiment10timeswithseedvalueof1to10andsplitpercentagesof90%fortrainingand10%fortesting.

4.1.1 Result ComparisonsThereareseveralparametersavailableintheJ48classifier(Weka.ClassJ48,2018).However,weconsidereditstwoimportantparameters:ConfidenceFactor,alsoknownasC,andMinimumNumberofObjects,alsoknownasM.TheCisusedforpruningwhereadditionalstepsareaddedtolookatwhatnodesorbranchescanberemovedtomakethetreesmallerandeasiertounderstandwithout

Table 4. Data conversion

Field name Raneg

Date-Time UNIXtimestampconversion

Date year unchanged;e.g.2016

month unchanged;e.g.2(i.e.February)

season spring summer automn winter

1 2 3 4

day unchanged;e.g.23

weekday numaricvalues:1,2,3,4,…,7(e.g.1forMonand7forSun)

weekportion begginingofweek middleofweek

endofweek weekend

1(i.e.forMon)

2(i.e.forTues-Wed)

3(i.e.forThur-Fri)

4(i.e.Sat-Sun)

Time hours unchanged

minutes unchanged

seconds unchanged

timeofday morning afternoon evening night

1 2 3 4

machine_ID 15,000–19,000

user_ID 1,000–1,500

application type 0:Systemapplications 1:Userapplications

ID 20–49;e.g.43fortaskmgr.exe

50–99;e.g.73forbackup4all.exe

action type 0:Relatedtoprocesses 1:Relatedtofiles

ID 200-399 400-499

resource location N/A localdrive

networkdrive removabledrive Cloud

0 1 2 3 4

fileextension N/A .ctm .rls .Ing .txt …. .bkc

0 1 2 3 4 …. 19

folderdepth unchanged;e.g.2forc:\backup\backup.ctm

Page 15: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

15

affectingtheperformancetoomuch.Ingeneral,smallervaluesforCincurmorepruning.TheMidentifiestheminimumnumberofinstancesoreventsperleaf.Thedefaultandconfiguredvalues(alsoknownashyperparameters)for(C,M)are(0.25,2)and(0.5,10),allrespectively.

ThereareseveralparametersavailableintheSVMclassifier(Weka.ClassSMO,2018).However,weconsidereditsimportantparameterof:ComplexitywhichisalsoknownasC(Saarikoski,2011).TheCvaluetellstheSVMoptimisationhowmuchwewanttoavoidmisclassifyingineachtrainingexample.Generally,smallervaluesofCgeneratemoremisclassifiedexamples.ThedefaultvalueforCis1whichweconfiguredto100inourexperiments.

ThereareseveralparametersavailableintheNBclassifier(Weka.ClassNaiveBayes,2018).However,weconsidereditstwoimportantparametersof:KernelEstimator(KE)andSupervisedDiscretisation(SD).TheKEparameteruseskernelestimatorfornumericattributesratherthananormaldistribution.TheSDparameterusessuperviseddiscretizationtoconvertnumericattributestonominalones.Bydefault,theKEandSDvaluesaresettofalseandwechangedthemtotrueatthetimetomonitortheirimpactontheresults.

AddressingtheRQ1andRQ3,inthefirstsetofoursupervisedlearningexperiments,wewanttoidentifytheimpactof:1)balancingthedatasetand2)changingtheCandMvaluesforJ48,CvalueforSVM,andKEandSDvaluesforNBonthemetricssuchas:classificationaccuracy,timetakentobuildthemodel,timetakentotestthemodel,TruePositive(TP)rate,FalsePositive(FP)rate,precision,recall,andf-measure.

Weraneachexperimenttentimeswiththeseedvalueof1to10,measuredtheweightedaverage,andthenrepresenteditasourresult.Foreachexperiment,weconsidered90%splitfortrainingand10%fortesting.Thishasbeendoneforbothbalancedandimbalanceddataset.Additionally,J48(DF)representsJ48withthedefaultvalueof0.25forCand2forM,J48(2)representsJ48withtheconfiguredvalueof0.5forCand10forM,SVM(DF)representsSVMwiththedefaultvalueof1forC,SVM(2)representsSVMwiththeconfiguredvalueof100forC,NB(DF)representsNBwiththedefaultvalueswhereSDandKEarebothsettofalse,NB(SD)representsNBwiththeconfiguredvalueoftrueforSD,andNB(KE)representsNBwiththeconfiguredvalueoftrueforKE.

Theclassificationaccuracies(weightedaverage)onthebalancedandimbalanceddatasetsarecapturedinFigure1andFigure2,respectively.Addressingthepresentedresults,allalgorithmsexceptNB(DF)showabetterperformanceontheimbalanceddataset incomparisonwiththebalanceddataset.NB(DF)algorithmshowsaround1.03%classificationaccuracyboostwhenusingthebalanceddataset.Therefore,answeringRQ1,balancingthedatasetdoesn’timprovetheclassificationaccuracyoverall.Additionally,answeringRQ3,changingparametersforeachclassificationalgorithmhavemoreeffectwhenusingimbalanceddatasetincomparisonwiththebalancedone.Indetailsandusingimbalanceddataset,J48(2)performsbetterthanJ48(DF),SVM(DF)performsbetterthanSVM(2),andNB(SD)performsbestincomparisonwithNB(DF)andNB(KE).However,itonlyhaseffectonthebalanceddatasetwhenweuseNBalgorithms.

WepresentthetimetakentobuildandthetimetakentotestthemodelonthebalancedandimbalanceddatasetsinFigure3andFigure4,respectively.Addressingthecapturedresults,forallthealgorithmsonbothdatasetsexceptNB(KE),thetimetakentobuildthemoduleismoreandsometimessignificantlymorethanthetimetakentotestthemodel.AnsweringRQ1,balancingthedatasetdoessignificantlyimprovethetimetakentobuildthemodelforallsevenalgorithms.ThisisthesamecaseforthetimetakentotestthemodelexceptforJ48(DF)inwhichwehaveanequaltimefortestingonbothdatasets.Additionally,answeringRQ3,changingparametersforeachclassificationalgorithmhaveeffectondatasetsbutitisratherunsteadywhenwecomparethem.Indetail,ontheimbalanceddataset,J48(DF)performsbetterthatJ48(2)andSVM(DF)performsbetterthanSVM(2)intermsoftimetakentobuildandtotestthemodel.However,thisistheoppositeforthebalanceddataset.

InFigure5andFigure6,wepresentthetruepositiveandthefalsepositiveratesonthebalancedandimbalanceddatasets,respectively.Addressingthecapturedresults,forallthealgorithmsandonbothdatasets,thetruepositiverateissignificantlymorethanthefalsepositiverate.Forthebalanced

Page 16: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

16

dataset,thehighesttruepositiverateequallybelongstoJ48(DF),J48(2),SVM(DF),andSVM(2)whilethelowestratebelongstoNB(SD).Wealsorealisedanearly2.27%jumpinthefalsepositiveratewhenwechangedtheclassifiertoNBalgorithms.Fortheimbalanceddataset,thehighesttruepositiveratebelongstoSVM(DF)whilealltheNBsshowpoorperformancesingeneral.However,alltheNBsperformbetterintermsoffalsepositiverateincomparisonwithSVMandJ48.AnsweringtheRQ1,balancingthedatasetdecreasesthetruepositiverateexceptforNB(DF)andimprovesthe falsepositive rateexcept forNBalgorithms.Additionally,answering theRQ3,changing thealgorithm’sparametersimprovethetruepositiveandfalsepositiveratesbutwithstrongerimpactontheimbalanceddatasetthanthebalanceddataset.

Figure7andFigure8representtheprecision,recall,andf-measureforallsevenalgorithmsonthebalancedandimbalanceddatasets,respectively.Ingeneral,thesethreemetricsarehigherontheimbalanceddatasetincomparisonwiththebalanceddataset,exceptforNB(DF)recall.Therefore,answering the RQ1, balancing the dataset does not improve the precision, recall and f-measureoverall.Giventheimbalanceddataset,wenoticedthatchangingtheparameterstoJ48(2),SVM(DF)andNB(SD)doimprovetheprecision,recall,andf-measure.However,thisisthecaseonlyforNB(DF)andNB(KE)onthebalanceddataset.Therefore,answeringtheRQ3,changingthealgorithm’sparametersimprovetheprecision,recall,andf-measurebutwithstrongerimpactontheimbalanceddatasetthanthebalanceddataset.

Figure 1. Classification accuracy (weighted average) using the balanced dataset

Page 17: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

17

Table5presents thesummaryofour results relating toAccuracy,Time toBuild themodel(TB),TimetakentoTestthemodel(TT),TruePositive(TP)rate,FalsePositive(FP)rate,Precision(P),Recall(R),andF-measure(F)withafocusonbalancedandimbalanceddatasets.Inthistable,weidentifyalgorithmswhichproducethebestresultswith“*”intermsofthemetricsabovewhere“B”represents“BalancedDataset”and“U”represents“ImbalancedDataset”.Wescoreeachandcalculate theoverallscorefor the table.Overall,ourexperimentsonthe imbalanceddataset(i.e.originaldataset)showabetterperformanceincomparisonwiththebalancedoneexceptforthetimetakentobuildandthetimetakentotestthemodel.Therefore,answeringRQ1,foreachclassifier,balancingthedatasetdoesn’timprovemetricssuchas:ClassificationAccuracy,TruePositiverate,FalsePositiverate,Precision,Recall,andF-measureingeneral.However,itimprovesthetimetakentobuildandthetimetakentotestthemodel.

Table6and7representthesummaryofourresultsregarding:ClassificationAccuracy,TB,TT,TPrate,FPrate,P,R,andFwithafocusontheimpactofemployingdifferentparametersineachclassifieronthebalancedandimbalanceddatasets,respectively.InTable6andTable7,“e”represents“EqualResults”and“*”represents“TheBestPerformance”.

Addressingtwotables,runningeachclassifierwithdifferentparamentshasstrongerimpactwhenusingtheimbalanceddataset,Table7,comparedwiththeresultsfromthebalanceddataset,Table6.Indetails,asTable6represents,forthebalanceddataset,runningJ48andSVMwithdifferentparametersresultinalmostequaloutputsforClassificationAccuracy,TB,TT,TPrate,FPrate,P,

Figure 2. Classification accuracy (weighted average) using the imbalanced dataset

Page 18: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

18

Figure 3. Time taken to build/test the model using the balanced dataset

Figure 4. Time taken to build/test the model using the imbalanced dataset

Page 19: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

19

Figure 5. True positive/false positive using the balanced dataset

Figure 6. True positive/false positive using the imbalanced dataset

Page 20: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

20

Figure 7. Precision/recall/f-measure using the balanced dataset

Figure 8. Precision/recall/f-measure using the imbalanced dataset

Page 21: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

21

R,andFeach.However,thisisnotthecaseforNBwhereNB(SD)givesthelowestscorecomparedwithNB(DF)andNB(KE).

Moreover,asTable7represents,fortheimbalanceddatasettheimpactofemployingdifferentparametersonCA,TB,TT,TPrate,FPrate,P,R,andFisstronger.Forinstance,J48(2)andSVM(DF)performbetterthanJ48(DF)andSVM(2),bothrespectively.ForNBsetofalgorithms,NB(SD)performsthebestwhileNB(DF)andNB(KE)areequallyworse.

Therefore,answeringRQ3,runningtheclassifiers,i.e.J48,SVMandNB,withdifferentparamentswillaffecttheCA,TB,TT,TPrate,FPrate,P,R,andF,howevertheimpactissignificantlystrongerwhenusingimbalanceddatasetcomparedwiththebalanceddataset.

5. CoNCLUSIoN AND FUTURE WoRK

Inthispaper,weaddressedissuesregardingextremelyimbalanceddatasetswithafocusoninsiderthreatprobleminwhichtheminorityclassbelongstothemaliciousactivitiesandthemajorityclassbelongstothebenignactivities.Forthis,weprovidedacomprehensivereviewandimplementationonthedatapre-processingphasewhichisavitalstepinanydatamining/machinelearningprojects.Thisincludesapopularbalancingtechniqueknownasspreadsubsample.Additionally,wewidelyexplainedsixdemosetupsforthegeneraldatabreachscenarioincluding:datatheft,privilegeduserdatabreach,endpointsecurityprocessing,shadowITrisk,datasecurity,andsensitivefolderbreach.Wealsoinvestigatedseveralparametersforourchosenclassifiersandstudiedtheimpactofconfiguringeachclassifierwiththeseparamentsandcomparedtheirperformancewiththedefaultsetups.Forourexperiments,weidentifiedeightperformancemetricsincluding:ClassificationAccuracy(CA),TimetakentoBuildthemodel(TB),TimetakentoTestthemodel(TT),TruePositive(TP)rate,FalsePositive(FP)rate,Precision(P),Recall(R),andF-measure(F)alongwithafewpopularmachinelearningalgorithms(i.e.J48decisiontree,SVM,NB,andRF).

We then raised three researchquestions to answer in this study:1) the impactofbalancingthedataset during the data pre-processing phase (i.e. using spread subsample technique) on theperformancemetricsmentionedabove,2)theimportantparametersforthechosenclassifiers(i.e.J48decisiontree,SVM,NB,andRF)and3)theimpactofconfiguringourchosenclassifierswiththeidentifiedparamentsintermsofperformancemetricsincomparisonwiththedefaultsetupsforthebalancedandimbalanceddatasets.Answeringourresearchquestions,werealisedthatbalancingthedatasetdidnotimproveCA,TPrate,FPrate,P,R,andFingeneralbutitimprovedthetimetakentobuildthemodelandthetimetakentotestthemodel.Additionally,werealisedthatrunningtheclassifierswithdifferentparametersaffectedtheperformancemetrics(i.e.CA,TB,TT,TP,FP,P,R,andF)howevertheimpactwassignificantlystrongerontheimbalanceddatasetratherthanthebalanceddataset.

Forourfuturework,wewillinvestigateallthepossibleandpopularbalancingtechniquesandwillimplementthemonourextremelyimbalancedinsiderthreatdataset.Wewillthenanalyseandcomparetheirimpactsintermsofperformancemetricsmentionedabove(i.e.CA,TB,TT,TP,FP,P,R,andF).

ACKNoWLEDGMENT

TheauthorswouldwishtoacknowledgethesupportoftheEdinburghNapierUniversityandTheCyberAcademyforfundingthisworkand(ZoneFox,2017)forprovidingthedemoscenarios.

Page 22: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

22

Table 5. Supervised machine learning results; Classification Accuracy (AC), Time taken to Build/Test the model (TB, TT), True Positive (TP), False Positive (FP), Precision (P), Recall (R), and F-measure (F)

Balanced and Imbalanced datasets

Classification Accuracy

TB TT TP FP Precision Recall F-measure

B U B U B U B U

J48(DF) * * * *

*

* * *

J48(2) * * * *

*

* * *

SVM(DF) * * * *

*

* * *

SVM(2) * * * *

*

* * *

NB(DF) * * * *

*

* * *

NB(SD) * * * *

*

* * *

NB(KE) * * * *

*

* * *

Total score 1 6 14 0 5 9 1 20

Table 6. Supervised machine learning results; Accuracy, Time taken to Build/Test the model (TB, TT), True Positive (TP), False Positive (FP), Precision (P), Recall (R), and F-measure (F)

Balanced dataset Accuracy TB TT TP FP Precision Recall F-measure Total score

J48(DF)J48(2)

e * e e e e e 1

e * e e e e e 1

SVM(DF)SVM(2)

e e e e e e e 0

e * e e e e e e 1

NB(DF)NB(SD)NB(KE)

e * * * * * 5

* * 2

e * * * * * 5

Page 23: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

23

Table 7. Supervised machine learning results; Accuracy, Time taken to Build/Test the model (TB, TT), True Positive (TP), False Positive (FP), Precision (P), Recall (R), and F-measure (F)

Imbalanced dataset Accuracy TB TT TP FP Precision Recall F-measure Total score

J48(DF)J48(2)

* * * 3

* * * * * 5

SVM(DF)SVM(2)

* * * * * * * 7

* 1

NB(DF)NB(SD)NB(KE)

* 1

* * * e * * 5

* e 1

Page 24: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

24

REFERENCES

Böse,B.,Avasarala,B.,Tirthapura,S.,Chung,Y.Y.,&Steiner,D. (2017).Detecting insider threatsusingradish:Asystemforreal-timeanomalydetectioninheterogeneousdatastreams.IEEE Systems Journal,11(2),471–482.doi:10.1109/JSYST.2016.2558507

Bradley,N.,Alvarez,M.,Kuhn,J.,&McMillen,D.(2015).IBM2015CyberSecurityIntelligenceIndex.

Buczak, A. L., & Guven, E. (2015). A survey of data mining and machine learning methods for cybersecurity intrusiondetection. IEEE Communications Surveys and Tutorials,18(2),1153–1176.doi:10.1109/COMST.2015.2494502

Cole,E.(2017).Defending Against the Wrong Enemy: 2017 SANS Insider Threat Survey.SANS.

Datapre-processing.(n.d.).Retrievedfromhttp://www.cs.ccsu.edu/~markov/ccsu_courses/datamining-3.html

Enterprise,V.(2017).2017Databreachinvestigationsreport.Zonefox.Retrievedfromhttps://zonefox.com

Galar,M.,Fernandez,A.,Barrenechea,E.,Bustince,H.,&Herrera,F.(2011).Areviewonensemblesfortheclassimbalanceproblem:Bagging-,boosting-,andhybrid-basedapproaches.IEEE Transactions on Systems, Man and Cybernetics. Part C, Applications and Reviews,42(4),463–484.doi:10.1109/TSMCC.2011.2161285

Gavai, G., Sricharan, K., Gunning, D., Hanley, J., Singhal, M., & Rolleston, R. (2015). Supervised andunsupervisedmethodstodetectinsiderthreatfromenterprisesocialandonlineactivitydata.JoWUA,6(4),47–63.

Gheyas,I.A.,&Abdallah,A.E.(2016).Detectionandpredictionofinsiderthreatstocybersecurity:Asystematicliteraturereviewandmeta-analysis.Big Data Analytics,1(1),6.doi:10.1186/s41044-016-0006-0

Greenberg,S.(1988).Usingunix:Collectedtracesof168users.

Hamed,T.,Ernst,J.B.,&Kremer,S.C.(2018).Asurveyandtaxonomyondataandpre-processingtechniquesofintrusiondetectionsystems.InComputer and Network Security Essentials(pp.113–134).Cham:Springer.doi:10.1007/978-3-319-58424-9_7

Hanifah,F.S.,Wijayanto,H.,&Kurnia,A.(2015).Smotebaggingalgorithmforimbalanceddatasetinlogisticregressionanalysis(case:Creditofbankx).Applied Mathematical Sciences,9(138),6857–6865.doi:10.12988/ams.2015.58562

Hashem,Y.,&Takabi,H.,GhasemiGol,M.,&Dantu,R.(2016).InsidetheMindoftheInsider:TowardsInsiderThreatDetectionUsingPsychophysiologicalSignals.Journal of Internet Services and Information Security,6(1),20–36.

Hassabis,D.,&Silver,D.(2017).Alphagozero:Learningfromscratch.deepMind.

Insiders,C.(2018).Insiderthreat-2018report.CATechnologies.Retrievedfromhttps://www.cert.org/insider-threat/tools/

Legg,P.A.,Buckley,O.,Goldsmith,M.,&Creese,S.(2015).Automatedinsiderthreatdetectionsystemusinguserandrole-basedprofileassessment.IEEE Systems Journal,11(2),503–512.doi:10.1109/JSYST.2015.2438442

Lindauer,B.,Glasser,J.,Rosen,M.,&Wallnau,K.C.(2014).GeneratingTestDataforInsiderThreatDetectors.JoWUA,5(2),80-94.

Moradpoor,N.,Brown,M.,&Russell,G.(2017,October).Insiderthreatdetectionusingprincipalcomponentanalysisandself-organisingmap.InProceedings of the 10th International Conference on Security of Information and Networks(pp.274-279).ACM.doi:10.1145/3136825.3136859

Parveen,P.,Evans,J.,Thuraisingham,B.,Hamlen,K.W.,&Khan,L.(2011,October).Insiderthreatdetectionusingstreamminingandgraphmining.InProceedings of the 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing(pp.1102-1110).IEEE.doi:10.1109/PASSAT/SocialCom.2011.211

Parveen,P.,&Thuraisingham,B.(2012,June).Unsupervisedincrementalsequencelearningforinsiderthreatdetection.InProceedings of the2012 IEEE International Conference on Intelligence and Security Informatics(pp.141-143).IEEEPress.doi:10.1109/ISI.2012.6284271

Page 25: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

25

Pavlov,D.Y.,Gorodilov,A.,&Brunk,C.A.(2010,October).BagBoo:ascalablehybridbagging-the-boostingmodel.InProceedings of the 19th ACM international conference on Information and knowledge management(pp.1897-1900).ACM.

PonemonInstitute.(2016).2016 Cost of Insider Threats: Benchmark Study of Organizations in the United States.

Saarikoski,H.(2011),ExplanationofSMOParameters?Retrievedfromhttps://list.waikato.ac.nz/pipermail/wekalist/2010-December/050570.html

Silver,D.,Huang,A.,Maddison,C.J.,Guez,A.,Sifre,L.,VanDenDriessche,G.,...Dieleman,S.(2016).MasteringthegameofGowithdeepneuralnetworksandtreesearch.Nature,529(7587),484.

Singh,A.,&Patel,S.(2014).ApplyingmodifiedK-nearestneighbortodetectinsiderthreatincollaborativeinformationsystems.Ijirset.Com,3(6),14146–14151.

Torrey,L.&Shavlik,J.(2009).TransferLearning.InHandbookofResearchonMachineLearning.AcademicPress.

Tuor,A.,Kaplan,S.,Hutchinson,B.,Nichols,N.,&Robinson,S.(2017,March).Deeplearningforunsupervisedinsiderthreatdetectioninstructuredcybersecuritydatastreams.InWorkshops at the Thirty-First AAAI Conference on Artificial Intelligence.AAAIPress.

Veeramachaneni,K.,Arnaldo,I.,Korrapati,V.,Bassias,C.,&Li,K.(2016,April).AI^2:trainingabigdatamachinetodefend.InProceedings of the 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS)(pp.49-54).IEEEPress.

Weka.ClassJ48.(n.d.).Retrievedfromhttp://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html

Weka. Class NaiveBayes. (n.d.). Retrieved from http://weka.sourceforge.net/doc.dev/weka/classifiers/bayes/NaiveBayes.html

Weka. Class SMO. (n.d.). Retrieved from http://weka.sourceforge.net/doc.stable/weka/classifiers/functions/SMO.html

Weka.ClassSpreadSubsample.RetrievedApril01,2018,fromhttp://weka.sourceforge.net/doc.dev/weka/filters/supervised/instance/SpreadSubsample.html

Weka.(n.d.).Retrievedfromhttps://www.cs.waikato.ac.nz/ml/weka/

Williams,G.,Baxter,R.,He,H.,Hawkins,S.,&Gu,L.(2002,December).AcomparativestudyofRNNforoutlierdetectionindatamining.InProceedings of the 2002 IEEE International Conference on Data Mining, 2002(pp.709-712).IEEEPress.doi:10.1109/ICDM.2002.1184035

Page 26: Insider Threat Detection Using Supervised Machine Learning … · 2020. 3. 10. · 1 Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced

International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020

26

Naghmeh Moradpoor (PhD, MSc, BSc, CEH, FHEA, MBCS) is a lecturer for cybersecurity and networks, the program leader for MSc Advanced Security and Cybercrime, a core member of the Centre for Distributed Computing, Networking and Cybersecurity, and an active member of The Cyber Academy in the School of Computing (SoC) at Edinburgh Napier University. She is a Certified Ethical Hacker (CEH) and an active researcher since 2009. Naghmeh has a research background in machine learning for cybersecurity as well as next generation broadband access networks. This includes Industrial Control System (ICS) and Critical Infrastructure Protection (CIP) security analysis and countermeasures, detection and classification of various cybersecurity attacks for computer networks, data analytics and decision support for cybersecurity, intrusion detection and prevention systems (IDS/IPS) for Internet of Things (IoT), cyberbullying detection and prevention, and forensic similarity hashes. She holds a PhD degree (2013) in Computer Systems and Networks from Ulster University and MSc in Computer Systems and Networks (2009) from Greenwich University, UK. Naghmeh was awarded Outstanding Woman in Cyber at the Scottish Cyber Awards 2017. These Awards are a highly prestigious and flagship event and the first of its kind in Scotland where the best of the best in cybersecurity are being recognised and awarded for their passion, dedication and hard work raising standards in both industry and academia. She has actively participated in leading and organising many cybersecurity events. This includes: Post Graduate Cyber Security (PGCS’16) symposium sponsored by SICSA, Poster Competition for Women in Cyber Security (PCWiCS’17) sponsored by SICSA, Poster Competition in Cyber Security (PCiCS’18) sponsored by SICSA, The First International Conference on Cyber Security and Digital Forensics (CSDF’18) sponsored by SICSA NEXUS, and The First/Second International Workshop on Securing IoT Networks (SITN’17 & ‘18). Naghmeh has been an invited speaker for Women in Science Festival - Dundee (2015), Critical Infrastructure Protection in SEE-IT Summit (Cybersecurity track) - Serbia (2018) sponsored by SICSA and SICSA NEXUS, and Gender Equality & Women in Cybersecurity in SEE-IT Summit - Serbia (2018) sponsored by SICSA and SICSA NEXUS. She has been invited to serve as a program Chair, technical program committee member and a session Chair of major international conferences and an editorial board as well as a reviewer for prestigious journals since 2013. Some of these conferences and journals include: 7th to 11th international conference on Security of Information and Networks (SIN), Special Session on Industrial Automation and Process System’s Security (ISIE-IAPSS’17) in conjunction with IEEE ISIE’17, The First/ Second International Workshop on Secure Smart Societies in Next Generation Networks (SECSOC’17 & ‘18) in conjunction with IEEE iThings’17 & ‘18. Journal of Information Security and Applications (Elsevier), Journal of Computers & Security (Elsevier), Journal of Cyber Security Technology, IEEE Access Journal, and Journal of Optical Switching and Networking (Elsevier). In addition, Naghmeh has been involved in supervising Honours, summer interns, MSc, and PhD students since she achieved her PhD in 2013. Naghmeh is a Director of Studies (DoS) for two collaborative PhDs. The first one is a PhD research work on ICS & CIP security analysis and countermeasures using machine learning with a focus on clean water supply systems. It is an ongoing and successful collaborative research work between Edinburgh Napier University (School of Computing) and Edinburgh Napier University (School of Engineering & Built Environment). Secondly, Naghmeh acts as a DoS for a PhD project on cyberbullying detection using victim’s sentiment analysis in social media. It is also a current collaborative and successful research work between Edinburgh Napier University (School of Computing) and King Abdulaziz University in the Kingdom of Saudi Arabia. Naghmeh and her students (Honours, MSc and PhD) have produced significant and ongoing joint research publications.