insider threat detection using supervised machine learning … · 2020. 3. 10. · 1 insider threat...
Post on 21-Jan-2021
4 Views
Preview:
TRANSCRIPT
DOI: 10.4018/IJCWT.2020040101
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
Copyright©2020,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited.
1
Insider Threat Detection Using Supervised Machine Learning Algorithms on an Extremely Imbalanced DatasetNaghmeh Moradpoor Sheykhkanloo, Edinburgh Napier University, Edinburgh, UK
Adam Hall, Edinburgh Napier University, Edinburgh, UK
ABSTRACT
Aninsiderthreatcantakeonmanyformsandfallunderdifferentcategories.Thisincludesmaliciousinsider, careless/unaware/uneducated/naïve employee, and the third-party contractor. Machinelearningtechniqueshavebeenstudiedinpublishedliteratureasapromisingsolutionforsuchthreats.However,theycanbebiasedand/orinaccuratewhentheassociateddatasetishugelyimbalanced.Therefore, this article addresses the insider threat detectionon an extremely imbalanceddatasetwhichincludesemployingapopularbalancingtechniqueknownasspreadsubsample.Theresultsshowthatalthoughbalancingthedatasetusingthistechniquedidnotimproveperformancemetrics,itdidimprovethetimetakentobuildthemodelandthetimetakentotestthemodel.Additionally,theauthorsrealisedthatrunningthechosenclassifierswithparametersotherthanthedefaultoneshasanimpactonbothbalancedandimbalancedscenarios,buttheimpactissignificantlystrongerwhenusingtheimbalanceddataset.
KEyWoRDSData Pre-Processing, Imbalanced Dataset, Insider Threat, Spread Subsample, Supervised Machine Learning
1. INTRoDUCTIoN
Insiderattackspresentaconsiderableissueinthecyber-threatlandscape,with40%oforganisationslabellingthevectorasthemostdamagingattackfaced(Cole,2017)and(Moradpoor,2017).In2016,thecontainmentandremediationofreportedinsiderthreatscostaffectedorganisations4milliondollarsonaverage(PonemonInstitute,2016).Inaddition,insiderthreatsareextremelycommonamongcyber-incidents;in2015,55%ofcyber-attackswereinsiderthreatcases(Bradley,2015).Despitethehighcostandfrequentoccurrenceofinsiderthreatattacks,detectionandmitigationremainaproblem.In2018,90%ofcompaniesareregardedvulnerable(Insiders,2018).Afurther38%ofcompaniesacknowledgethattheirinsiderthreatdetectionandpreventioncapabilitiesarenotadequate(Cole,2017).Thisdisparitydemonstratesasignificantgapbetweenthecurrentadvancementsininsiderthreatdetection,andtherequirementsofbusinesses.Giventheavailabilityofcomputationalresources,itisfeasibletouseMachineLearning(ML)techniquestosolveproblemsoflargercomplexitythanhaspreviouslybeenpossible.AstrongprecedentofthiscanbeobservedinrecenthistorywiththegrowthofthefieldofBigData.ThisisalsoexemplifiedbythehistoricachievementofGoogleDeepmind(Hassabis,2017),creatingamachinelearningalgorithmwhichmasterstheimmenselycomplexboardgameGo(Silver,2016).Mostorganisationshavetheresourcestokeeplogsofemployeeinteractionswithtechnology.Byharnessingthedataproducedthroughlogging,thisinformationcouldbedigested
CORE Metadata, citation and similar papers at core.ac.uk
Provided by Repository@Napier
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
2
intoaformatuponwhichpredictionsregardinginsiderthreatcasescouldbemade.Havingsaidthis,adatadrivenapproachtoinsiderthreatmitigationisnotanewidea,thisisafieldexperiencinganincreasingrateofpublication.However,vanguardattemptsstillreportmoreeffectivemodelsthanlatercaseswheremachinelearninghasbeenapplied(Gheyas,2016).
Inmachinelearning/dataminingprojects,animbalanceddatasetisadatasetinwhichthenumberofobservationsbelongingtooneclassisconsiderablylowerthanthosebelongingtootherclass/classes.Apredictivemodelemployingconventionalmachinelearningalgorithmscouldbebiasedandinaccuratewhenbeingemployedonsuchdatasets.Thisispurelybecausemachinelearningalgorithmsaredesignedtoimproveaccuracybyreducingtheerrorinthenetwork.Therefore,theydonotconsidertheclassdistribution,classproportion,orbalanceoftheclassesintheirclassificationprocess.Apredictivemachinelearningmodelbeingbiasorinaccuratecanbepredominantinscenarioswheretheminorityclassbelongstothemaliciousactivitiesandtheanomalydetectionisextremelycrucial.Thisincludesscenariossuchas:occasionalfraudulenttransactionsinbanks,irregularinsiderthreats,rarediseaseidentification,naturaldisastersuchasearthquakes,andperiodicmaliciousactivitiesoncriticalinfrastructures(e.g.infrequentattacksonnuclearpowerplantsorwatersupplysystemsinacity).Giventheimportanceofthesescenarios,aninaccurateclassificationbyapredictivemachinelearningmodelcouldcostthousandsoflivesorhugecosttoindividualsand/ororganisations.Thereareseveraltechniquestosolvesuchclassimbalanceproblemsusingvarioussampling/non-samplingmechanismse.g.oversampling,undersealingandSMOTEaswellasensemblemethodsandcost-based techniques. However, the importance of an imbalanced dataset has not been clearly andadequatelyinvestigatedintheliteratureparticularlyformachinelearning-basedsolutionsforinsiderthreatdetections.
Therefore,inthispaper,ourfocusisonanextremelyimbalanceddatasetofinsiderthreatswherethenumberofeventsbelongingtothemaliciousclassisconsiderablylowerthanthosebelongingtothebenignclass.Weusespreadsubsample(Weka.ClassSpreadSubsample,2018)asapopularbalancingtechnique.Thefilterallowsyoutospecifythemaximumspreadbetweentherarestclassandthemostcommonone.Forexample,givenanimbalanceddataset,youmayindicatethatthereshouldbeafigureof3:1differenceinclassfrequency.Forthis,theoriginaldatasetfirstfitsinthememory then a randomsubsampleof adatasetwill beproducedgiven the identifiedmaximumspreadbetween twoclasses. In thispaper,wespecify themaximum“spread”between therarestclass (i.e.maliciousevents)and themostcommonclass (i.e.benignevents)as“1” representinguniformdistributionbetweentwoclasses.Thisallowsustokeepallthemaliciouseventsplustheequalnumberofthebenigneventsselectedrandomlywhichresultsinhavingauniformdistributionofmaliciousandbenignevents.
Inthispaper,weraisethefollowingspecificresearchquestions:
RQ1: Does balancing the dataset during the pre-processing phase improve metrics such as:ClassificationAccuracy(CA),TimetakentoBuildthemodel(TB),TimetakentoTestthemodel(TT),TruePositive(TP)rate,FalsePositive(FP)rate,Precision(P),Recall(R),andF-measure(F)incomparisonwiththesamemetricsbutonanimbalanceddataset?
RQ2:Whataretheimportantparametersforeachclassifierthatconfiguringthemcouldhaveanimpactontheclassificationresults?
RQ3:Doeschangingtheseparameterswithdifferentvaluesimprovemetricssuchas:ClassificationAccuracy(CA),TimetakentoBuildthemodel(TB),TimetakentoTestthemodel(TT),TruePositive (TP) rate,FalsePositive (FP) rate,Precision (P),Recall (R), andF-measure (F) incomparisonwithrunningtheclassifierswiththedefaultparameters?
Additionally,thispaperprovidesacomprehensiveexplanationandinvestigationonthedatapre-processingstagewhichisacrucialpartofanydatamining/machinelearningprojects.Forthis,aclearstep-by-stepdescriptionisprovidedbytheauthors.Furthermore,weprovidedsixcomprehensive
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
3
databreachscenarioswithfulljustificationsandexplanationswhichincludes:datatheft,privilegeduserdatabreach,endpointsecurity,shadowitrisk,datasecurity,andsensitivefoldersprotection.
Theremainderofthispaperisorganisedasfollows.InSection2,wereviewtherelatedworkbased on our research questions followed by our data analysis in Sections 3. This is trailed byimplementation, results andanalysis inSection4, conclusionand futurework inSection5, andacknowledgmentaswellasreferences.
2. RELATED WoRK
Insider threatproblemshavebeenstudiedwidelyin theexistingliterature.Thiscoversextensivecategoriessuchas:traitorandmasquerader,real-timeandnon-realtime,host-based,network-basedandhybrid(host-basedplusnetwork-based),userprofiling,useofdifferentdatasetsaswellasmachinelearning-basedsolutions.
Given that thispaper focuseson insider threatdetectionusing supervisedmachine learningalgorithmsforanimbalanceddataset,inthissection,weaddresstheexistingworkwheresupervised,unsupervised,andsemi-supervisedapproacheshavebeenemployed.Forcompleteness,wefirstbrieflyexplainfourpopularcategoriesofMachineLearning(ML)techniques:supervised,unsupervised,semi-supervisedandtransferlearningsasfollows.
SupervisedMLtechniquessuchas:NearestNeighbourmethods(e.g.k-NN),NeuralNetworks(NNs)andSupportVectorMechanism(SVM),builddataclassificationmodelfromlabelledtrainingdatawhereforeverysingleinput(e.g.x1,x2)thereisacorrespondingtarget(e.g.y1,y2).Itisaclassificationproblem,ifthetargetsarerepresentedinsomeclassesandalternativelyaregressionproblem,ifthetargetsarecontinuous.Asthenamesuggests,thereisastrongelementof“supervision”in supervised learning.Lengthy training time,high trainingcostsand the requirements for largeamountsofwell-balanceddataaresomeofthedrawbacksforthismethodology.
InunsupervisedMLtechniquessuchas:GaussianMixturemodels,Self-OrganisingMap(SOM)andGraph-BasedAnomalyDetection(GBAD),thedataisunlabelled.Thismeansforeachsingleinput(e.g.x1,x2),thereisnotargetoutputnorrewardfromitsenvironment.Thegoalofunsupervisedlearningistofindhiddenstructureorrelationamongstunlabelleddataforclusteringorcompressionpurposes.However,despitesupervisedlearning,sincethereisnoelementof“supervision”andthereisnolabel,unsupervisedlearningcan’tprovideevaluationoftheaction.Additionally,neitherpre-determinationonthenumberofclassesnorgettingveryspecificaboutthedefinitionoftheclassesispossible.
Semi-supervisedlearningtechniquessuchas:SelfTraining,GenerativeModels,GraphBasedAlgorithmsandSemiSupervisedSupportVectorMachines(S3VMs),fallbetweensupervisedandunsupervisedlearningbymakinguseofbothlabelleddata(typicallysmallamount)andunlabelleddata(typicallylargeamount)fortraining.Therefore,thegoalistoovercomeoneproblemfromeachcategory:nothavingenoughdatainsupervisedlearningandhavingnoclassificationinun-supervisedlearning.Hence,giventhatgeneratinglabelleddataisoftencostlyasitneedsalotofresourcessuchas manpower and computation while unlabelled data is generally not, semi-supervised learningsoundslikeapowerfulapproachparticularlyforbigdata.However,itsuffersfromsomelimitations.Forinstance,whenthedatapatternschangeinasemi-supervisedapproach,oldassumptionsaboutlabelleddatamayscrewupthenewunlabelleddata.
Transferlearningisamachinelearningtechniquewithanextrasourceofknowledgeinadditiontotraditionaltrainingdata.Someworkintransferlearningincludesextendingpopularclassificationand probability distribution systems such as artificial neural networks, Bayesian networks andMarkov Logic Networks. Three common measures by which transfer learning might improvelearningcomprises:1)performanceimprovementintargettaskwithtransferlearninginadditiontothetraditionallearningfromdatacomparedtothenon-transferlearningscenario,2)totaltimetofullylearnthetargettaskwithandwithouttransferlearningprocess,and3)initialperformancein
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
4
thetargettasklearningusingonlythetransferredknowledgecomparedwiththescenariowhereonlytraditionaltrainingdataisused(Torrey,2009).
However,negativetransferavoidancewhereapplyingtransfermethodsdecreasesperformance,automationoftaskmappingtotranslatebetweentaskrepresentations,transferbetweenmorediversetasks,andperformingtransferinmorecomplextestbedsarethechallengesfacingtransferlearningtechniques.
AsummaryoftherecentworkoninsiderthreatdetectionusingmachinelearningtechniquesispresentedinTable1.Weclassifythembasedondifferentmetricssuchas:MLalgorithmse.g.Supervised(S)(Singh,2014)and(Gavai,2015),Unsupervised(U)(Parveen,2011),)Tuor,2017),(Böse,2017)and(Parveen,2012)orSemi-supervised(Se)(Gavai,2015),and(Veeramachaneni,2016).TheaddressedworkinTable1isalsoclassifiedbasedonthesourceofdatae.g.userlogsoravailableinsiderthreatdatasets,setupenvironment,experimentresultsandperformancemetricse.g.FalsePositive(FP)andFalseNegative(FN)rates.ThespecificationsrelatedtoourworkinthispaperisalsopresentedinTable1.
Authors in (Singh, 2014) employed supervised Modified k-Nearest Neighbour (k-NN) withCommunityAnomalyDetectionSystem(CADS)andmetaCADSonnon-real-timeuseraccesslogs.CADSandmetaCADSaretheframeworksthatdon’tdependonanypriorknowledgesuchasuserroleoraccesscontrolmechanismtodetectanomalies.WhileCADSincludestwostepsofpatternextractionandanomalydetection,metaCADScontains twomorestepsofnetworkconstructionassignmentandcomplexcategoryinterface.TheyclaimedthattheModifiedk-NNbeatsk-NN.However,theirpresentedresultdoesn’tshowthisandareratherunclear.Noperformancemetricse.g.FPorFNarepresentedeither.Additionally,someexamplesonuseraccesslogssuchasuser–subjectrelationshipandthesubject–categoryrelationshipwouldhavemadetheirworkclear.
Authors in (Hashem, 2016) used supervised SVM on a combination of real-time humanbiologicalsignalpatternssuchasElectroEncephaloGraphy(EEG)andElectroCardioGram(ECG)withtenvolunteersandthreescenariosthatincludedmaliciousandbenignactivities.Theycaptured86%averagedetectionaccuracywithEEGwhichwasincreasedby5%afteraddingECGsignals.However,justificationfornotincludingElectroMyoGraphy(EMG)signalsinadditiontoEEGandECGsignalshasnotbeenidentified.Additionally,theirresultsfromPrincipalComponentAnalysis(PCA)havenotbeenclearlydetailed.Giventhattheirworkreliesonuserswearingtheheadsetandsensorsallthetimetomeasurethesignalscontinuously,theriskofwearingthemonuser’shealthisalsonotclear.Additionally,notonlywearingheadsetandsensorsconstantlyisinconvenientforauserbutaninsiderislesslikelytowearthemeffectivelyknowingthathis/heractivatesaregoingtobemonitoredcontinuously.
Authorsin(Tuor,2017)employedunsupervisedDeepNeuralNetworks(DNNs)andRecurrentNeuralNetworks(RNNs)onComputerEmergencyResponseTeam(CERT)InsiderThreatDatasetv6.2(CERT)withTenserflow.GiventhattheirworkfunctionsonrawsystemlogsfromCERTinsiderthreatdataset,theyconsidereditasequivalenttorawstreamingdatathuscallingtheirworkonlineandrealtime.TheycapturedCumulativeRecall(CR-k)fordailybudgetsof400and1000whichshowedthatDNNsandRNNsoutperformPrincipalComponentAnalysis(PCA),SVMandIsolationForest(IF).Giventhattheirworkmodelsonlynormalbehaviourandusesanomalyasanindicatorofpotentialmaliciousactivities,FPratehasnotbeenclearlyaddressed.CERTdatasetisbasedonCERTmodelwhichprovidesadvantagessuchassufficientstudyoninsiders’motivationandpsychologicalorbehaviouralaspectsoftheircrimeoroffenseinthefieldofinsiderthreat.However,CERThassomedrawbacks.Forexample,thereisnoinsiderthreatclassificationfortherawsystemlogsinthisdataset.CERTmodelisalsocomplexevenafterthelatestimprovements.Thismeansforusingthismodelagroupofexpertsisalwaysneededtoperformprediction.
Authorsin(Parveen,2012)usedunsupervisedincrementalsequencelearningcombinedwithcompressionlearningforinsiderthreatdetection.Theyworkedon168realbenigntracefilesofusersinUniversityofCalgaryinadditiontotheir25maliciousartificialfiles.ThetracefilesaretheUnixCshelcollected fromfourgroupsofpeopleandcategorisedbasedon theirprogrammingskills.
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
5
Table 1. Related work on insider threat detection using machine learning techniques
Author, year ML: S/ U/ Se
ML algorithms Source of Data Environment, Configurations,
Framework
Experimental results
Captured performance
(e.g. FP and FN)
Real Time?
(Singh,2014) S Modifiedk-NearestNeighbour(k-NN)
Useraccesslogs CommunityAnomalyDetectionSystem(CADS)&metaCADSframeworks
ModifiedkNNbeatsk-NN
N/A No
(Hashem,2016) S SupportVectorMachine(SVM)
Tenvolunteersequippedwith:EmotivEPOCheadsetandOpenBCIsensors
Electroencephalography(EEG)andtheelectrocardiogram(ECG)signals
91%averagedetectionaccuracy
Presented:Accuracy,Precision,Recall,F-measure,andError-rate
Yes
(Parveen,2011) U Ensemble-Graph-BasedAnomalyDetection(E-GBAD)withStreamMiningandGraphMining
1998LincolnLaboratoryIntrusionDetectionDataset
Systemlogsfromthedataseteachspecifiesbyatokenwitheightattributes
E-GBADapproachismoreeffectivethantraditionalsingle-modelapproach
0%FNandalowerFPratethansingle-model
No
(Tour,2017)
U DeepNeuralNetworks(DNNs)&RecurrentNeuralNetworks(RNNs)
ComputerEmergencyResponseTeam(CERT)InsiderThreatDatasetv6.2
UsedTenserflow.Tunedthedatabasedonrandomhyper-parametersearche.g.learningrateof0.01forbothDNNandRNN
DNNsandRNNsoutperformPrincipalComponentAnalysis(PCA),SVMandIsolationForest(IF)
Presented:CumulativeRecall(CR-k)fordailybudgetsof400and1000
Yes
(Parveen,,2012) U Compressionandincrementallearning
UsertracefilesoftheUnixCshelinUniversityofCalgaryinadditiontotheirownfiles
168realbenignfilescollectedfromfourgroupsofpeopleinadditiontotheir25artificialmaliciousfiles
TheirapproachperformedwellwithlimitednumberofFPcomparedtostaticapproach
Presented:AverageFPandaverageTPrates
No
(Bose,2017) U Unsupervisedk-NNandkdimensional(k-d)tree
GeneratedbyDARPAADAMSprogram
Real-timeAnomalyDetectionInStreamingHeterogeneity(RADISH)system
k-dtreeismuchfasterthankNN.RADISHperformanceanditsaccuracypresented
N/A Yes
(Veeramachaneni,2016)
Se MatrixDecomposition,NeuralNetworks,andJointProbability
Athree-monthreallogdataproducedtotalof3.6billionlinesgeneratedbyenterpriseplatform
Components:ananalyticsplatform,anunsupervisedrareeventmodeller,afeedbackplatform,andasupervisedlearningmodule
Thesystemlearnstodefendagainstunseenattacks
Presented:TPandFPrates;TPrateimproved;FPratesreducedwhentimeprogresses
Nearlyrealtime
(Gavai2015) Se UnsupervisedModifiedIsolatedForest+SupervisedRandomForests
Areal-worlddatasetnamed’Vegas’forbenignactivitiesinadditiontoartificiallyinjectedinsiderthreatevents
Anunsupervisedandasupervisedapproach:toidentifyabnormalityandtolabelquitters,respectively
Unsupervised:ROCscoreof0.77%,supervised:accuracyof73.4%
Presented:ROC,RecallandPrecision
No
Thiswork,2019 S Supervisedmachinelearningalgorithms:J48SVM,NaïveByes(NB),andRandomForest(RF)
Sixdemoscenariosobtainedfrom(ZoneFox,2017)
Capturedeightfeaturesforfiveusersinfourconsecutivedaysforhourstoconstructfiveuserprofiles
Balancingthedatasetdoesn’timproveperformancemetrics;theimpactofusingdifferentparametersforclassifiersissignificantlystrongeronimbalanceddataset
Presented:Accuracy,TimetakentoBuildthemodel(TB),TimetakentoTestthemodel(TT),TruePositive(TP)rate,FalsePositive(FP)rate,Precision(P),Recall(R),andF-measure(F)
No
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
6
TheycapturedFPandTPaverageratesthatshowstheirworkperformedwellcomparedtothestaticapproach.However,themaliciousanomalousdataisartificialandcraftedbytheauthor.Additionally,thispublicdataset(Greenberg,1988)isnowdated.
Authorsin(Veeramachaneni,2016)proposedanearlyreal-timesupervisedandunsupervisedsystemwithananalyst-in-the-loopagainstunseenattack.Theunsupervisedpartoftheirsystemlearnsamodelthatcanidentifyextremeandrareeventsindata.Therareeventsthenpresentedtothesystemanalystwholabelsthemaseithermaliciousorbenign.Attheend,thelabelleddataisprovidedtothesupervisedlearningpartoftheirsystemwhichproducesamodelthatcanpredictwhethertherewillbeattacksinthefollowingdays.Ontheunsupervisedpartoftheirmodel,theyemployedMatrixDecomposition,NeuralNetworksandJointProbabilitytodetectrareeventseachofwhichgeneratesascoreindicatinghowfaracertaineventisfromothers.Theyusedatotalof3.6billionrealloglinesproducedwithin3monthsandpresentedthattheTPrateimprovesbyalmostthreetimesandFPratereducesbyfivetimesastimeprogresses.
Authorsin(Gavai,2015)employedunsupervisedModifiedIsolationForest(IF)andsupervisedRandomForestalgorithmstodiscoverinsiderthreatsbasedonemployees’socialandonlineactivitydata.Theirapproachincludesgeneratingsomefeaturesbasedonemailcontent,work-practiceandonlineactivitywhichisthenusedintheunsupervisedpartoftheproposaltoidentifysuspiciousbehaviour.Thesefeaturesarealsousedinthesupervisedpartoftheirproposalinconjunctionwithquittinglabelstodevelopaclassifier.Theyusedareal-worlddatasetnamedVegasforbenignactivitiesinadditiontotheirownartificiallyinjectedinsiderthreatevents.Theeventsincludepatternsforemailcommunication,webbrowsing,emailfrequency,andfileandmachineaccess.TheyobtainedaROCscoreof0.77%fortheunsupervisedpartoftheirworkandaclassificationaccuracyof73.4%forthesupervisedpartshowingthattheirapproachissuccessfulindetectinginsiderthreatactivityaswellasquittingactivity.However,themaliciousanomalousdataisartificialandcraftedbytheauthors.
Theworkpresentedinthispaperdiffersfromtherelatedworksdescribedaboveaswefocusonemployingmachinelearningtechniquesonextremelyimbalanceddatasetinwhichthemaliciousclassisintheminorityandthebenignclassisinthemajority(i.e.themaliciouseventsarelessthan1.3%ofthetotaldataset).Giventhattheimportanceoftheclassimbalanceproblemhasnotbeenaddressedintheexistingliteraturefortheinsiderthreat,ourmethodologyforexploringtheimpactofdatabalancingderivedfrom(Galar,2011),(Hanifah,2015)&(Pavlov,2010)wheredifferentbalancingtechniqueshavebeendiscussedincludingundersampling,oversamplingandhybrid-basedapproachesfromwhichweemployapopularundersamplingtechniquecalledspreadsubsample.Ourworkinthispaperisbuiltuponourpreviouswork(Moradpoor,2017)whereweusedSelf-OrganisingMap(SOM)inconjunctionwithPrincipalComponentAnalysis(PCA)forinsiderthreatdetectiononthesameimbalanceddatasetthatweuseinthispaper.However,inourpreviouswork,thedatasetwasunlabelled.Giventhatitislesslikelytohaveabalanceddatasetinareal-worldscenario,knowinghowtodealwithanimbalanceddatasetisnotonlycrucialbutalsomorerealistic.
Fortheclassifiers,weconsiderJ48decisiontree,SupportVectorMachine(SVM),NaïveBayes(NB), andRandomForest (RF)algorithmsand forperformancemetricswe studyClassificationAccuracy(CA),TimetakentoBuildthemodel(TB),TimetakentoTestthemodel(TT),TruePositive(TP)rate,FalsePositive(FP)rate,Precision(P),Recall(R),andF-measure(F).Theseperformancemetricsaremainlyderivedfrom(Buczak,2015)whereacomprehensivesurveyofdataminingandmachinelearningtechniquesforcybersecurityintrusiondetectionhasbeenfullydiscussed.
Additionally,weprovideacomprehensivereviewandexplanationsonthedatapre-possessingphase.Thisphase,whichismissingfromtheexistingworkonML-basedinsiderthreatdetectione.g.from(Singh,2014),(Parveen,2011),(Tuor,2017),(Bose,2017),(Hashem,2016),(Gavai,2015),(Parvenn,2012)and(Veeramachaneni,2016),isanextremelyimportantpartofanymachinelearningprojectsgiventhatittransfersarawororiginaldatasettoanunderstandableandmeaningfulformat.Thispartofourworkisencouragedby(Hamed,2018)whereataxonomyofdatapre-processingtechniquesforintrusiondetectionsystemshasbeendiscussed.Inthispaper,thedatapre-processing
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
7
phaseincludesoutlieridentificationfromdatacleaningstage,datadecompositionanddataconversionfromdatatransformationstage,anddatabalancingfromdatareductionstage.Thecomprehensiveexplanationsondatapre-processingphaseonextremelyimbalanceddatasetinthispapercanbeusedasaguideandcanbeemployedonanydatamining/machinelearningprojects.
Furthermore,despite(Singh,2014),(Parveen,2011),(Tuor,2017),(Bose,2017),(Hashem,2016),(Gavai,2015),(Parvenn,2012)and(Veeramachaneni,2016),thispapercontainssixcomprehensivedemosetupswhichcoversalltheseriousdatabreachscenarios:datatheft,privilegeduserdatabreach,endpoint security breach, shadow it risk, data security, and sensitive folders breach. Besides, tomakethescenariosmorerealistic,itincludesdifferentarrangementsforthreegroupsofstaffwithinanorganisation:permanentstaff(x3scenarios),temporarystaff(x1scenario)andthird-partystaff(x2scenarios).Ourmethodologyfordefining,justifying,anddevelopingthesixscenariosaboveisenthusedby(Lindauer,2014)wheregeneratingtestdataforinsiderthreatdetectorshasbeendescribed.Theworkin(Legg,2015)alsoinspiredourworkintermsofusinguserandrole-basedprofilesforautomatedinsiderthreatdetection.
3. ANALySING THE DATA
Inthispaper,sixdemoscenarioshavebeenidentifiedwhichhelptoshapethedatasetusedinourexperiments. They have been developed by (ZoneFox, 2017), which is a market leader in UserBehaviourAnalytics,toensurefamiliarisationoftheinsiderthreatsinanorganisation.Inthissection,wediscuss:thesixdemoscenarios,theoriginaldatasetthatweusedforourexperimentsandthedatapre-processingstage(outlieridentification,datadecompositionanddataconversion).
3.1 DEMo ScenariosOursixdemoscenarios includedifferent setups for threegroupsof staff:permanent, temporaryandthirdparty.Thiscontainsthreescenariosforpermanentstaff:DataTheft,EndpointSecurityProcessingandShadowITRisk,onescenariofortemporarystaff:PrivilegedUserDataBreachandtwoscenariosforthird-partymemberofstaff:DataSecurityandProtectSensitiveFolders.Toclarifyapermanentstaffisalong-termmemberofstaffwhoworksforinstanceincompany’sengineeringorsalesteame.g.Charlotte,RebeccaandLaura.Atemporarystaffisashort-termmemberofstaffwhohasbeenemployedforthebusyperiode.g.Timmy.Athird-partymemberofstaffisanindividualwhohasbeenemployedtoworkwithoneoftheclientsystemse.g.Colin.Thenamesareallfictionalandhavebeenusedforthesakeofmakingourdemoscenariosclear.AllsixscenariosarepresentedinTable2anddefinedasfollows.
3.1.1 Demo Sceanrio 1: Data TheftADataTheft foranorganisationcommonly includesstealingsensitive informationrelated to itsstaff, clients or business e.g. usernames, passwords, credit card information, medical records orbusinesssecrets.InDemoScenario1,wefocusonthedatatheftthreatfromemployeeswithinagivenorganisationbyfollowingtheInsiderThreatKillChain(ITKC).ITKCidentifieshumanresourcesasthegreatestrisktoorganisationsanddiscusseshowmembersofstaffwithinorganisationscanworktogethertohelpaverttheserisksbeforetheybecomeaproblem.Toachievethis,weconsiderfourgroupsofpeoplewithintheorganisation:1)peoplewhointendedtoleavetheirjobandalreadyhandedintheirnotice,2)peoplewholookaroundfileserversrepeatedly,3)peoplewhodownloadbackupsoftwareontheirdesktopcomputersand4)peoplewhocopyzipfilestotheirremovabledevices.Forexample,asitisshowninTable2DemoScenario1,Charlotte,whoisapermanentmemberofstaffworkinginthecompany’sengineeringteam,backsupfilestoaremovablediskdrive.
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
8
3.1.2 Demo Scenario 2: Privileged User Data BreachIntrudersneedprivilegedaccess,forinstanceadmin’susernameandpassword,tocarryoutmaliciousactivitiesandproceedattackstothenextphase.Mischievousactivitiessuchasinstallingmalicioussoftware,e.g.malware,ransomwareorbackdoor,stealingsensitiveinformation,e.g.usernamesandpasswords,orevendisablinghardwareand/orvitalsoftwareonavictim’scomputer,e.g.anti-malwareoranti-spyware.
Thatiswhyprivilegeduseraccountsareinsuchhighdemandsbyintruders.Infact,therecanbesomanysharedprivilegedaccountswithinanorganisationwithnomanagementknowledgeonwhereallthoseaccountsresideorwhohasaccesstothem.InVerizonDataBreachInvestigationsReportfor2016(Enterprise,2017),63%ofallbreacheswereduetousingweakorcommonpasswordswhile53%ofthemweredowntothemisuseofprivilegedaccounts.Privilegedaccountsinclude:localadminaccounts,privilegeduseraccounts,domainadminaccounts,emergencyaccounts,serviceaccountsandapplicationaccounts.Anefficientprivilegemanagementsystemwouldautomaticallyidentifytheseaccountsandbringthemunderamanagementsystemumbrella.InDemoScenario2,pinpointingprivilegeduserdatabreachisdonebygoingthroughthefollowingsteps:1)identifyingprivileged user accounts, 2) identifying sensitive files that could only be accessed using thoseaccounts,and3)monitoringtheaccesstothosesensitivefiles.Forexample,asitisshowninTable2DemoScenario2,Timmy,whoisatemporarymemberofstaffwhohasaccesstoprivilegeduseraccounts,accessesafile,i.e.minutesofcompany’sFebruarymeeting,onthenetworkfilesystemthathedoesnotneedaccessto.
3.1.3 Demo Scenario 3: Endpoint Security ProcessingInanorganisation,endpointsincludedesktopcomputers,mobiledevices(e.g.laptops,smartphonesandtablets),printers,scannersorevenbarcodereaders.Endpointsecuritymanagementisanetworksecurityapproachthatrequiresendpointdevicestofollowthesecuritypolicyofagivenorganisation.Thisneedstobemetbeforeaswellasaftergrantingaccesstonetworkresources.InDemoScenario3,weconsiderendpointsecurityprocessingbymonitoringthestaffactivitiesondesktopcomputersacrosstheorganisation.Forinstance,turningoffordisablingasystem’santi-malwaree.g.popup
Table 2. Six demo scenarios of users
Demo scenarios Staff type Staff name
Application Resource Description
Demoscenario1:Data Theft
Permanent Charlotte bbackup.exe rm:\\e:\\myusb\\backup.zip Charlottebacksupfilestoaremovablediskdrive.
Demoscenario2:Privileged User Data Breach
Temporary Timmy N/A nfs:\\......\\fileshare2\\boardminutes\\minutes-february.txt
Timmyaccessesafileonthenetworkfilesystemthathedoesnotneedtoaccessto.
Demoscenario3:Endpoint Security Processing
Permanent Laura Savservice.exe(avsoftware)
N/A Lauradeactivatestheanti-virussoftwareonhercomputer.
Demoscenario4:Shadow IT Risk
Permanent Rebecca dropbox.exeSkype.exe
c:\\users\\rebecca\\dropbox\\plan2.docc:\\users\\rebecca\\dropbox\\plan1.docnfs:\\....\fileshare1\\engineeringplans\\plan1.docnfs:\\....\fileshare1\\engineeringplans\\plan2.doc
RebeccausesDropboxtoperformunauthorisedbackupstotheCloud.RebeccausesSkypetoperformunapproveduploadsfromnetworkfilesystemtoCloud.
Demoscenario5:Data Security
Third-party
Colin N/A nfs:\\......\fileshare1\\engineeringplans\\plan1.docrm:\\f:\\copyto\\remdrive\\ipdata.txt
Colinaccessesafileonthenetworkfilesystemthatheisnotsupposedtoaccessandcopiesittoaremovablediskdrive.
Demoscenario6:Protect Sensitive Folders
Third-party
Colin N/A nfs:\.....\fileshare1\PatientData\AliceBrownMedicalRecord.txt
Colinaccessessensitiveinformation(AliceBrown’smedicalrecord)onthenetworkfilesystemthatheshouldhavenoneedtoaccessto.
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
9
blockers,anti-spyware,anti-spam,host-basedfirewallsandanti-viruses.Forexample,asitisshowninTable2DemoScenario3,Laura,whoisapermanentmemberofstaffworkinginthecompany’ssalesteam,deactivatestheanti-virusonherdesktopcomputer.
3.1.4 Demo Scenario 4: Shadow it RiskShadowITreferstosoftwareorapplicationspurchased,downloaded,installed,usedormanagedoutsideorwithouttheknowledgeofanorganisation’sITdepartment.TheycanbedefinedastheITassetsthatareinvisibletoanorganisation’sITdepartment.ShadowIThasgrownexponentiallyinrecentyears.ThisisdowntothegoodqualityofapplicationsintheCloud,mobiletechnologygrowthandtherapiddevelopmentinSoftwareasaService(SaaS),suchas:Dropbox,CiscoWebEx,GoogleApps,Salesforce,Skype,andMicrosoftOffice365.ASaaS,whichisalsoknownasCloudsoftware, on-demand software or hosted software, is a way of delivering applications over theInternet.AgivenSaaSmayormaynotofferstrongsecurityprotectionse.g.identitymanagement,authentication,accesscontrol,securebackuppractices,datamaskingordataencryption.Therefore,itcanexposeanorganisationand/oritsaffiliatestodatalossriskandsecurity-relatedthreatssuchasinsiderthreats.InDemoScenario4,weconsiderShadowITrisk,whichisimposedbyemployeeswithintheorganisationusingunknown/unauthorisedSaaSe.g.DropboxorSkype.Forexample,asitisshowninTable2DemoScenario4,Rebecca,whoisapermanentmemberofstaffworkinginthecompany’sengineeringteam,installsDropboxtoperformunauthorisedbackupsandSkypetoaccomplishunapproveduploadstotheCloud.
3.1.5 Demo Scenario 5: Data SecurityDataSecurityreferstoprotectivemeasuresappliedtopreventunauthorisedaccessandcorruptiontoresourcessuchasendpointdevices(e.g.computers,printersandtablets),databases,websites,andcomputerfiles.Backups,dataencryption,authenticationanddatamaskingarethecommonlyuseddatasecuritytechniques.Forexample,datamaskingisamethodofprotectingtheactualdatabycreatingastructurallysimilarbutfakeversionofanorganisation’sdataforpurposessuchastrainingorsoftware/algorithmtesting.InDemoScenario5,weconsidermonitoringresourcesthatathird-partymemberofstaffisnotsupposedtoaccessormakeacopyfrome.g.accessingasensitivedocumentorcopyingintellectualpropertyfilestoaremovabledisk.Forexample,asitisshowninTable2DemoScenario5,Colin,whoisathird-partymemberofstaff,accessesandcopiesintellectualpropertydatatoaremovablediskdrivethatheisnotsupposedtodo.
3.1.6 Demo Scenario 6: Protect Sensitive FoldersIngeneral,datacanbecategorisedaseitherpublic,restrictedorprivate.Thepublicdatasuchas:pressreleases,courseinformationorresearchpublicationscanbeavailabletoanyonewithlittleornocontrolstoprotecttheirconfidentiality.Therestricteddatasuchas:creditcardnumbers,passwordsor personal medical information are those for which the unauthorized disclosure, alteration ordestructionofthemcouldcauseasignificantlevelofrisktoanorganisationoritsaffiliates.Theprivatedatasuchas:homeaddress,birthdate,gender,religiousorsexualorientationarethoseforwhichtheunauthorizeddisclosure,alterationordestructionofthemcouldresultinamoderatelevelofrisktoanorganisationoritsaffiliates.InDemoScenario6,wefocusonprotectingfolderswithrestricteddatae.g.foldersthatcarrymedicalrecordsforindividualsmaintainedbytheorganisation.Forexample,asitisshowninTable2DemoScenario6,Colin,whoisathird-partymemberofstaff(i.e.acontractor),accessesrestricteddata(i.e.oneofthestaff’smedicalrecords)whichheshouldhavenoneedtoaccess.
3.2 original DatasetInthissection,weexplainthedatasetthatweusedforourexperiments.Asmentionedbefore,theoriginaldatasethasbeengivenby(ZoneFox,2017)andcoversallthesixscenariosthatwedefined
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
10
in the previous section. It includes Charlotte, Rebecca, Laura, Timmy and Colin’s user profilescapturedonfourconsecutivedays.Thedatasetalsocontainstheadministrator’snetworkactivities.Asdiscussedbefore,eachuseriseitherapermanentmemberofstaff(e.g.Charlotte,Rebecca,LauraorAdministrator),atemporarymemberofstaff(e.g.Timmy)orathird-partymemberofstaff(e.g.Colin).Theoriginaldatasetisin.CSVformatandcontains2643linesofrawdatawhichincludessixfeatures:Date-Time,machine_ID,user_ID,application,actionandresource.Eachlineof thedatasetidentifiesanactiondonebyoneoftheusersintheuserpool(i.e.Charlotte,Rebecca,Laura,Timmy,ColinorAdministrator).Forexample,onelineofthedatasetspecifiesthaton2016-02-23at16:30:27(Date-Time),employeeA(user_ID)usedcomputerB(machine_ID)torunbackup.exe(application)andwrite(action)thebackupintopathD(resource).
3.3 Data Pre-ProcessingIn this section, we explain the pre-processing phase employed on our original dataset given by(ZoneFox,2017).Thisphaseisanimportantpartofanymachinelearningprojectsgiventhatittransfersarawororiginaldatasettoanunderstandableandmeaningfulformat.Rawdataisoftenincomplete,noisy,inconsistenceand/orlackingcertainbehaviours,attributesorstylesandislikelytogenerateorcompriseerrorsiffedintactintomachinelearningalgorithms.Datapre-processingisaprovenmethodofresolvingtheseissues.Itcomprisesstagessuchas:datacleaning,dataintegration,datatransformation,datareductionanddatadiscretisationeachcontainingvarioustasks.Forexample,data transformation comprises tasks such as normalisation and aggregation while data cleaningincludesfillinginmissingdata,outlieridentification,outlierremovalandinconsistencyresolution.
Inthispaper,thedatapre-processingphaseincludesfourtasksof:outlieridentificationfromdatacleaningstage,datadecompositionanddataconversionfromdatatransformationstage,anddatabalancingfromdatareductionstageasfollows.
3.4 Data CleaningInanydatascienceproject,datacleaningisacriticalpartofthedatapre-processingphase.Thisincludestaskssuchas:fillinginmissingvalues,identifyingoutliers,smoothingoutnoisydataorcorrectinginconsistentdata.Forexample,forfillinginmissingvaluesalearningalgorithmsuchasBayesordecisiontreecanbeusedtopredictthem.Additionally,domainknowledgeorexpertdecisioncanbeemployedtocorrectinconsistentdata.Inthispaper,weusedoutlieridentificationtaskfromdatacleaningstageasfollows.
3.4.1 Outlier IdentificationIn data science, “outliers” are values that “lie outside” the other values e.g. in the scores of:{3,22,25,23,29,33,85},3and85areoutliergiventhattheyarefarawayfromthemaingroupofdata:{22,25,23,29,33}.Indatamining,outlieridentification,whichisalsoknownasanomalydetection,referstoidentificationofitems,eventsorbehaviourswhichdonotfollowanexpectedpattern.Itisanobservationofthedatathatdeviatesfromotherobservationssomuchthatitawakenssuspicionsthatitwasgeneratedbyadifferentand/orunusualmechanism(Williams,2002).Inthedatapre-processingphase,outlieridentificationisapartofthedatacleaningstageandincludestaskssuchasbinning,clusteringandregression.Inthispaper,outliermeansinsiderthreatandoutlieridentificationreferstodetectionofdigitaltasksperformedbyemployeeswhichtheyarenotsupposedtodo/ortheyshouldhavenoneedtodo.Inordertoperformoutlieridentificationonouroriginaldataset,wesurfedthrougheacheventmanuallyandmarkeditaseither1,representinganoutlierevent,or0,representinganon-outlierevent.Thisisdecidedbyconsideringthesixdemoscenariosexplainedintheprevioussection(i.e.DataTheft,EndpointSecurityProcessing,ShadowITRisk,PrivilegedUserDataBreach,DataSecurity,SensitiveFoldersProtection)alongwiththeindividual’srolewithintheorganisation.Theoriginaldatasetisin.CSVformatandincludes2643linesofuseractivitiesforfiveindividuals:(i.e.Charlotte,Rebecca,Laura,TimmyandColin).Theoutlieridentificationtask
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
11
resultedinidentifyingatotalof33outlieractivitiesoutof2643eventswithintheoriginaldataset.AsitisdepictedinTable3,totaloutliereventsof6,5,11,1and10belongtoCharlotte,Rebecca,Colin,LauraandTimmy,allrespectively.
Given that this paper is based on a supervised machine learning approach, the outlieridentificationtaskaddsanextrafeatureof“Outlier”toourdatasetwhichisalsoknownas“label”.Outlieridentificationtaskorlabellingtheentiredatasetisdonetoprovideaplatformtoevaluatetheperformanceofoursupervisedmachinelearningapproach.Italsoassistswithourfutureworkwhichisbasedonanunsupervisedmachinelearningapproach.Additionally,labellingtheentiredatasetwillassistusinimplementingasemi-supervisedapproachforourpossiblefutureworkwhichisanimplementationofasupervisedmethodinadditiontotheunsupervisedmethodthatcouldhaveapotentialtoimprovetheaccuracyrateforinsiderthreatdetection.
3.5 Data TransformationMachinelearningalgorithmswereunlikelytoprocessanydatacorrectlyand/orproduceanyaccurateresultssuchaspredictionsifthedatadoesnotfollowasimilartype.Therefore,weemployeddatatransformationstageonouroriginaldatasetduringthepre-processingphasetotransformthemintoasimilardatatype.Thisincludestwostepsofdatadecompositionanddataconversionasfollow.
3.5.1 Data DecompositionInadataset,decomposingsomevaluesintomultiplepartswillhelpamachinelearningalgorithmincapturingmorespecificrelationships.Forinstance,datadecompositiononafeaturesuchas“date”representedas“Tues;01.04.2017”into:dayofaweekandmonthofayearmayprovidemorerelevantinformation.Infact,datadecompositionistheoppositeofdatareductiongiventhatitwilladdmoredatatotheoriginaldataset.
Inthispaper,weemploydatadecompositiononDate-Time,application,actionandresourcefeaturesinourdatasettocapturemorespecificrelationshipbetweentheindividualuserbehaviourandinsiderthreat.
Forinstance,inouroriginaldataset,Date-Timerepresentedas2016-02-23T16:26:33Zindicatinga user event that has happened on 23th of February 2016 at 16:26:33 hours. This includes twocharacters:T,whichcanbereadasanabbreviationforTime,andZ,whichstandsforzero-timezoneasitisoffsetby0fromtheCoordinatedUniversalTime(UTC).
AfterdatadecompositionDate-Timefeaturebreakdownto:Year(2016),Month(02),Season(Winter),Day (23),WeekDay(Tuesday),WeekPortion (Middleofweek),Hours (16),Minutes(26),Seconds(33)andTimeofDay(Afternoon).Likewise,theapplicationfeatureisdecomposedtoapplicationtype(i.e.systemapplicationoruserapplication)andapplicationID,actionfeatureisdecomposedtoactiontype(i.e.relatedtoprocessesorrelatedtofiles)andactionIDandresourcefeatureisdecomposedtoresourcelocation(i.e.localcomputer,localnetwork,removablememoryorCloud),fileextension(e.g..txt,.zipor.bkc)andfolderdepth(e.g.1,2or3).
Table 3. Outlier identification
Employee Name Total Events 441/2643454?? Outlier Events 33/2643
Charlotte 146 6
Rebecca 75 5
Colin 57 11
Laura 135 1
Timmy 28 10
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
12
3.5.2 Data ConversionOften,machinelearningalgorithmsrequirethedatainspecificswaysbeforefeedingthemintothemodel.Thiscanbedoneduringthedataconversiontaskwhichistheconversionofdatafromoneformattoanotherformate.g.fromcategoricaldatatonumericvalues.Categoricaldataarethedatathatcontainlabelvalues,whichisoftenlimitedtoafixedset,ratherthannumericalvalues.Someexamplesinclude:“weekday”variablewiththevaluesof:“Monday”,“Tuesday”and“Wednesday”or“season”variablewiththevaluesof:“spring”,“summer”and“autumn”.Therefore,giventhatmostmachinelearningalgorithmscan’toperateoncategoricaldatadirectlyandrequireallinputandoutputvariablestobenumeric,integerencodingalsoknownasnumericconversionandone-hotencodingcanbeused.Inintegerencoding,eachcategoricaldataisassignedanintegervaluee.g.“Monday”is1,“Tuesday”is2,and“Wednesday”is3.However,inone-hotencoding,abinaryvariableisaddedforeachvaluee.g.inthe“weekday”variableexampleabove“Monday”is001,“Tuesday”is010,and“Wednesday”is011.
Inthispaper,thedataconversiontaskhasbeenemployedon13outof20featuresofourdataset.This includesnumeric conversion for:Date-Time, season,weekday,weekportion, timeof day,machine_ID,user_ID,application_type,application_ID,action_type,action_ID,resource_locationandfileextension.
Forexample,inourmaindataset,theDate-Timefeatureforeacheventcontainsstandarddateandtimeformatincludingbothnumericandtextcharacter.Forinstance,2016-02-23T16:26:33Zrepresentsausereventthathappenedon23rdofFebruary2016at16:26:33hours.Thisincludestwocharacters:TforTime,andZforzero-timezoneasitisoffsetby0fromtheCoordinatedUniversalTime(UTC).ForDate-Timeconversion,wefirstsplitthisfeaturetodateandtime.Thisgivesus:2016-02-23Tand16:26:33Zforourrunningexampleabove.WethenremovedTandZcharactersandformattedthedateto:23/02/2016whiletimestaysthesame:16:26:33.WethenconverteddateandtimetoaUNIXtimestampwhichresultsin1456185600and59193forthedateandtime,respectively.Inthelaststep,wecombinedthemtogetherwhichgivesus:1456244793.UNIXtimestampisthenumberofsecondsbetweenaparticulardateandtheUnixEpochonJanuary1st,1970atUTC.WeruntheUnixtimestampconversionontheDate-Timefeaturefortheentiredataset.
Besidesthis,aswediscussedintheprevioussection,theDate-Timefeaturewasfurthersplitintoyear,month,season,day,weekdayandweekportion(forDatefeature)andhours,minutes,secondsandtimeofday(forTimefeature)duringthedatadecompositiontask.Therefore,weallocatea1-4rangetoseason,1-7toweekday,1-4toweekportionand1-4totimeofdayinthedataconversiontask.However,year,month,day,hours,minutesandsecondsremainunchangedduringthistask.
Furthermore,inouroriginaldataset,eachmachine_IDisacombinationofnumbers,uppercaseandlowercaseletterse.g.4RcZBZz.Wedefined15,000–19,000rangeformachine_IDdataconversionfromwhichanintegervalueisassignedtoeachcomputer.Forexample,16002hasbeenassignedtoacomputerwithanidof4RcZBZz.Likewise,1000–1500rangeisallocatedtotheuser_IDsinourdataset.ThismeansauniquenumericalvalueforAdministrator,Charlotte,Rebecca,Laura,TimmyandColine.g.1301hasbeenassignedasauser_IDtoColin.
Similarly,20–99rangehasbeenassignedtoapplication_ID,200-499toaction_ID,0-4toresource_locationand0-19tofileextension.Moreover,intermsofapplication_typeandaction_type,0representssystemsapplicationandprocessrelatedactionsand1representsuserapplicationandfilerelatedactions,bothrespectively.
Toputitinanutshell,sevenfeaturesofyear,month,day,hours,minutes,secondsandfolderdepth,remainunchangedduringthedataconversiontask.AllthearrangementsfordataconversionhavebeenidentifiedinTable4.
3.6 Data ReductionThisstageofthedatapre-processingphaseincludesstepsasfollows(Datapre-processing,2018).“Reducingthenumberofattributes”:forinstance,removingirrelevantattributesorusingprinciple
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
13
componentanalysistosearchforalowerdimensionalspacethatcanbestrepresentthedata.“Reducingthenumberofattributevalues”:forexample,byclusteringwheresimilarvaluesgroupedinclusters.“Reducingthenumberofdatasamples”:forinstance,withinthemajorityclasstoreducethedegreeofimbalancedatadistributionandproduceabalanceddataset.Thisisalsoknownasdatabalancing.Inthispaper,weonlyemploydatabalancingduringthedatareductionstageasfollows.
3.6.1 Data BalancingIn machine learning and data mining, the imbalanced class distribution is a scenario where thenumberofinstancesbelongingtooneclassissignificantlylowerthanthosethatbelongtotheotherclasses.Insuchascenario,aswehaveinthispaper,theclassificationcouldbebiasedandinaccurate.Thishappensduetothefactthatmachinelearningalgorithmsareusuallydesignedtoincreasetheaccuracybyreducingtheerror.Therefore,toachievethis,theydonotconsidertheclassdistribution,classproportion,orbalanceofclasses.Therearevariousapproachesforsolvingclassimbalanceproblems.Thisincludestwoareasof:1)resamplingand2)classifiermodifications.Inthispaper,wefocusonthefirstcategorywherewereassembletheoriginaldatasettoprovidetwobalancedclasses(i.e.maliciousandbenign).Forthis,weuseWeka’sSpreadSubsamplefeature(Weka.ClassSpreadSubsample,2018)wheretheoriginaldatasetmustfirstfitentirelyinthememory.Thisincludestheentireimbalanceddatawhichcontainsmaliciousandbenignevents.Thenweneedtospecifythemaximum“spread”betweentherarestclass(i.e.“malicious”whichislabelledasclass“1”withthetotaleventsof“33”)andthemostcommonclass(i.e.“benign”whichislabelledasclass“0”withthetotaleventsof“2643”).Forinstance,“0”fordistributionSpreadofSpreadSubsamplemeansnomaximumspread,“1”meansuniformdistributionand“10”meansallowatmosta10:1ratiobetweentheclasses.Inthispaper,wechose“1”representinguniformdistributionbetweentwoclasses.Thisallowsustokeepallthemaliciousevents(i.e.33maliciousevents)plustheequalnumberofthebenigneventsselectedrandomly(i.e.33benignevents).Attheend,wewillhave66eventsintotalwithauniformdistributionofmaliciousandbenignevents(i.e.33maliciousand33benignevents).
4. IMPLEMENTATIoN, RESULTS AND ANALySIS
Inthissection,weexplainourcapturedresultsafterrunninganumberofpopularmachinelearningalgorithmswithdifferentparametersonourdatasets.Thisisdoneontwocopiesofourdatasets:balancedandimbalanced.Bothdatasetshavebeenpassedthroughthedatapre-processingphaseincluding:outlier identification,datadecompositionanddata conversion.However, as thenamesays,thebalanceddatasetincludesanextrastepofdatabalancing.WeuseWeka3(Weka,2018)inourexperimentswhichisapopularandpowerfultoolfordataminingandmachinelearning.Ithasacollectionofmachinelearningalgorithmsandcontainstoolssuchasdatapre-processing,classificationandvisualization.Wekaisalsowell-suitedfordevelopingnewmachinelearningschemes.Theresultscapturedfromourmachinelearningschemesareanalysedasfollows.
4.1 Supervised Machine Learning ResultsSupervisedlearningcanbecategorisedintoclassificationproblemswhentheoutputisaclassandregression problem when the output is a real value. Some popular supervised machine learningalgorithmsare:NaiveByes(NB)forregressionandclassificationproblems,LinearRegression(LR)forregressionproblems,RandomForest(RF)forregressionandclassificationproblems,SupportVectorMachines(SVM)forclassificationproblems,andNeuralNetworks(NNs)forregressionandclassificationproblems.Givenourdatasetsfromtheearliersection,theprobleminthispaperisaclassificationproblemaswewantoursupervisedapproachtoexplicitlyidentifyagiveneventeitherasbenign(i.e.itbelongstoclass“0”)ormalicious(i.e.itbelongstoclass“1”).Inthissection,weemployanumberofpopularmachinelearningalgorithmsforinstanceJ48decisiontree,SupportVector
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
14
Machine(SVM),NaïveByes(NB),andRandomForests(RF)configuredwithdifferentparametersalongwithseveraltechniques(e.g.singleclassifiersandbalancedvs.imbalanceddatasets).
Itisworthmentioningthat,forbothbalancedandimbalanceddatasets,weraneachexperiment10timeswithseedvalueof1to10andsplitpercentagesof90%fortrainingand10%fortesting.
4.1.1 Result ComparisonsThereareseveralparametersavailableintheJ48classifier(Weka.ClassJ48,2018).However,weconsidereditstwoimportantparameters:ConfidenceFactor,alsoknownasC,andMinimumNumberofObjects,alsoknownasM.TheCisusedforpruningwhereadditionalstepsareaddedtolookatwhatnodesorbranchescanberemovedtomakethetreesmallerandeasiertounderstandwithout
Table 4. Data conversion
Field name Raneg
Date-Time UNIXtimestampconversion
Date year unchanged;e.g.2016
month unchanged;e.g.2(i.e.February)
season spring summer automn winter
1 2 3 4
day unchanged;e.g.23
weekday numaricvalues:1,2,3,4,…,7(e.g.1forMonand7forSun)
weekportion begginingofweek middleofweek
endofweek weekend
1(i.e.forMon)
2(i.e.forTues-Wed)
3(i.e.forThur-Fri)
4(i.e.Sat-Sun)
Time hours unchanged
minutes unchanged
seconds unchanged
timeofday morning afternoon evening night
1 2 3 4
machine_ID 15,000–19,000
user_ID 1,000–1,500
application type 0:Systemapplications 1:Userapplications
ID 20–49;e.g.43fortaskmgr.exe
50–99;e.g.73forbackup4all.exe
action type 0:Relatedtoprocesses 1:Relatedtofiles
ID 200-399 400-499
resource location N/A localdrive
networkdrive removabledrive Cloud
0 1 2 3 4
fileextension N/A .ctm .rls .Ing .txt …. .bkc
0 1 2 3 4 …. 19
folderdepth unchanged;e.g.2forc:\backup\backup.ctm
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
15
affectingtheperformancetoomuch.Ingeneral,smallervaluesforCincurmorepruning.TheMidentifiestheminimumnumberofinstancesoreventsperleaf.Thedefaultandconfiguredvalues(alsoknownashyperparameters)for(C,M)are(0.25,2)and(0.5,10),allrespectively.
ThereareseveralparametersavailableintheSVMclassifier(Weka.ClassSMO,2018).However,weconsidereditsimportantparameterof:ComplexitywhichisalsoknownasC(Saarikoski,2011).TheCvaluetellstheSVMoptimisationhowmuchwewanttoavoidmisclassifyingineachtrainingexample.Generally,smallervaluesofCgeneratemoremisclassifiedexamples.ThedefaultvalueforCis1whichweconfiguredto100inourexperiments.
ThereareseveralparametersavailableintheNBclassifier(Weka.ClassNaiveBayes,2018).However,weconsidereditstwoimportantparametersof:KernelEstimator(KE)andSupervisedDiscretisation(SD).TheKEparameteruseskernelestimatorfornumericattributesratherthananormaldistribution.TheSDparameterusessuperviseddiscretizationtoconvertnumericattributestonominalones.Bydefault,theKEandSDvaluesaresettofalseandwechangedthemtotrueatthetimetomonitortheirimpactontheresults.
AddressingtheRQ1andRQ3,inthefirstsetofoursupervisedlearningexperiments,wewanttoidentifytheimpactof:1)balancingthedatasetand2)changingtheCandMvaluesforJ48,CvalueforSVM,andKEandSDvaluesforNBonthemetricssuchas:classificationaccuracy,timetakentobuildthemodel,timetakentotestthemodel,TruePositive(TP)rate,FalsePositive(FP)rate,precision,recall,andf-measure.
Weraneachexperimenttentimeswiththeseedvalueof1to10,measuredtheweightedaverage,andthenrepresenteditasourresult.Foreachexperiment,weconsidered90%splitfortrainingand10%fortesting.Thishasbeendoneforbothbalancedandimbalanceddataset.Additionally,J48(DF)representsJ48withthedefaultvalueof0.25forCand2forM,J48(2)representsJ48withtheconfiguredvalueof0.5forCand10forM,SVM(DF)representsSVMwiththedefaultvalueof1forC,SVM(2)representsSVMwiththeconfiguredvalueof100forC,NB(DF)representsNBwiththedefaultvalueswhereSDandKEarebothsettofalse,NB(SD)representsNBwiththeconfiguredvalueoftrueforSD,andNB(KE)representsNBwiththeconfiguredvalueoftrueforKE.
Theclassificationaccuracies(weightedaverage)onthebalancedandimbalanceddatasetsarecapturedinFigure1andFigure2,respectively.Addressingthepresentedresults,allalgorithmsexceptNB(DF)showabetterperformanceontheimbalanceddataset incomparisonwiththebalanceddataset.NB(DF)algorithmshowsaround1.03%classificationaccuracyboostwhenusingthebalanceddataset.Therefore,answeringRQ1,balancingthedatasetdoesn’timprovetheclassificationaccuracyoverall.Additionally,answeringRQ3,changingparametersforeachclassificationalgorithmhavemoreeffectwhenusingimbalanceddatasetincomparisonwiththebalancedone.Indetailsandusingimbalanceddataset,J48(2)performsbetterthanJ48(DF),SVM(DF)performsbetterthanSVM(2),andNB(SD)performsbestincomparisonwithNB(DF)andNB(KE).However,itonlyhaseffectonthebalanceddatasetwhenweuseNBalgorithms.
WepresentthetimetakentobuildandthetimetakentotestthemodelonthebalancedandimbalanceddatasetsinFigure3andFigure4,respectively.Addressingthecapturedresults,forallthealgorithmsonbothdatasetsexceptNB(KE),thetimetakentobuildthemoduleismoreandsometimessignificantlymorethanthetimetakentotestthemodel.AnsweringRQ1,balancingthedatasetdoessignificantlyimprovethetimetakentobuildthemodelforallsevenalgorithms.ThisisthesamecaseforthetimetakentotestthemodelexceptforJ48(DF)inwhichwehaveanequaltimefortestingonbothdatasets.Additionally,answeringRQ3,changingparametersforeachclassificationalgorithmhaveeffectondatasetsbutitisratherunsteadywhenwecomparethem.Indetail,ontheimbalanceddataset,J48(DF)performsbetterthatJ48(2)andSVM(DF)performsbetterthanSVM(2)intermsoftimetakentobuildandtotestthemodel.However,thisistheoppositeforthebalanceddataset.
InFigure5andFigure6,wepresentthetruepositiveandthefalsepositiveratesonthebalancedandimbalanceddatasets,respectively.Addressingthecapturedresults,forallthealgorithmsandonbothdatasets,thetruepositiverateissignificantlymorethanthefalsepositiverate.Forthebalanced
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
16
dataset,thehighesttruepositiverateequallybelongstoJ48(DF),J48(2),SVM(DF),andSVM(2)whilethelowestratebelongstoNB(SD).Wealsorealisedanearly2.27%jumpinthefalsepositiveratewhenwechangedtheclassifiertoNBalgorithms.Fortheimbalanceddataset,thehighesttruepositiveratebelongstoSVM(DF)whilealltheNBsshowpoorperformancesingeneral.However,alltheNBsperformbetterintermsoffalsepositiverateincomparisonwithSVMandJ48.AnsweringtheRQ1,balancingthedatasetdecreasesthetruepositiverateexceptforNB(DF)andimprovesthe falsepositive rateexcept forNBalgorithms.Additionally,answering theRQ3,changing thealgorithm’sparametersimprovethetruepositiveandfalsepositiveratesbutwithstrongerimpactontheimbalanceddatasetthanthebalanceddataset.
Figure7andFigure8representtheprecision,recall,andf-measureforallsevenalgorithmsonthebalancedandimbalanceddatasets,respectively.Ingeneral,thesethreemetricsarehigherontheimbalanceddatasetincomparisonwiththebalanceddataset,exceptforNB(DF)recall.Therefore,answering the RQ1, balancing the dataset does not improve the precision, recall and f-measureoverall.Giventheimbalanceddataset,wenoticedthatchangingtheparameterstoJ48(2),SVM(DF)andNB(SD)doimprovetheprecision,recall,andf-measure.However,thisisthecaseonlyforNB(DF)andNB(KE)onthebalanceddataset.Therefore,answeringtheRQ3,changingthealgorithm’sparametersimprovetheprecision,recall,andf-measurebutwithstrongerimpactontheimbalanceddatasetthanthebalanceddataset.
Figure 1. Classification accuracy (weighted average) using the balanced dataset
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
17
Table5presents thesummaryofour results relating toAccuracy,Time toBuild themodel(TB),TimetakentoTestthemodel(TT),TruePositive(TP)rate,FalsePositive(FP)rate,Precision(P),Recall(R),andF-measure(F)withafocusonbalancedandimbalanceddatasets.Inthistable,weidentifyalgorithmswhichproducethebestresultswith“*”intermsofthemetricsabovewhere“B”represents“BalancedDataset”and“U”represents“ImbalancedDataset”.Wescoreeachandcalculate theoverallscorefor the table.Overall,ourexperimentsonthe imbalanceddataset(i.e.originaldataset)showabetterperformanceincomparisonwiththebalancedoneexceptforthetimetakentobuildandthetimetakentotestthemodel.Therefore,answeringRQ1,foreachclassifier,balancingthedatasetdoesn’timprovemetricssuchas:ClassificationAccuracy,TruePositiverate,FalsePositiverate,Precision,Recall,andF-measureingeneral.However,itimprovesthetimetakentobuildandthetimetakentotestthemodel.
Table6and7representthesummaryofourresultsregarding:ClassificationAccuracy,TB,TT,TPrate,FPrate,P,R,andFwithafocusontheimpactofemployingdifferentparametersineachclassifieronthebalancedandimbalanceddatasets,respectively.InTable6andTable7,“e”represents“EqualResults”and“*”represents“TheBestPerformance”.
Addressingtwotables,runningeachclassifierwithdifferentparamentshasstrongerimpactwhenusingtheimbalanceddataset,Table7,comparedwiththeresultsfromthebalanceddataset,Table6.Indetails,asTable6represents,forthebalanceddataset,runningJ48andSVMwithdifferentparametersresultinalmostequaloutputsforClassificationAccuracy,TB,TT,TPrate,FPrate,P,
Figure 2. Classification accuracy (weighted average) using the imbalanced dataset
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
18
Figure 3. Time taken to build/test the model using the balanced dataset
Figure 4. Time taken to build/test the model using the imbalanced dataset
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
19
Figure 5. True positive/false positive using the balanced dataset
Figure 6. True positive/false positive using the imbalanced dataset
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
20
Figure 7. Precision/recall/f-measure using the balanced dataset
Figure 8. Precision/recall/f-measure using the imbalanced dataset
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
21
R,andFeach.However,thisisnotthecaseforNBwhereNB(SD)givesthelowestscorecomparedwithNB(DF)andNB(KE).
Moreover,asTable7represents,fortheimbalanceddatasettheimpactofemployingdifferentparametersonCA,TB,TT,TPrate,FPrate,P,R,andFisstronger.Forinstance,J48(2)andSVM(DF)performbetterthanJ48(DF)andSVM(2),bothrespectively.ForNBsetofalgorithms,NB(SD)performsthebestwhileNB(DF)andNB(KE)areequallyworse.
Therefore,answeringRQ3,runningtheclassifiers,i.e.J48,SVMandNB,withdifferentparamentswillaffecttheCA,TB,TT,TPrate,FPrate,P,R,andF,howevertheimpactissignificantlystrongerwhenusingimbalanceddatasetcomparedwiththebalanceddataset.
5. CoNCLUSIoN AND FUTURE WoRK
Inthispaper,weaddressedissuesregardingextremelyimbalanceddatasetswithafocusoninsiderthreatprobleminwhichtheminorityclassbelongstothemaliciousactivitiesandthemajorityclassbelongstothebenignactivities.Forthis,weprovidedacomprehensivereviewandimplementationonthedatapre-processingphasewhichisavitalstepinanydatamining/machinelearningprojects.Thisincludesapopularbalancingtechniqueknownasspreadsubsample.Additionally,wewidelyexplainedsixdemosetupsforthegeneraldatabreachscenarioincluding:datatheft,privilegeduserdatabreach,endpointsecurityprocessing,shadowITrisk,datasecurity,andsensitivefolderbreach.Wealsoinvestigatedseveralparametersforourchosenclassifiersandstudiedtheimpactofconfiguringeachclassifierwiththeseparamentsandcomparedtheirperformancewiththedefaultsetups.Forourexperiments,weidentifiedeightperformancemetricsincluding:ClassificationAccuracy(CA),TimetakentoBuildthemodel(TB),TimetakentoTestthemodel(TT),TruePositive(TP)rate,FalsePositive(FP)rate,Precision(P),Recall(R),andF-measure(F)alongwithafewpopularmachinelearningalgorithms(i.e.J48decisiontree,SVM,NB,andRF).
We then raised three researchquestions to answer in this study:1) the impactofbalancingthedataset during the data pre-processing phase (i.e. using spread subsample technique) on theperformancemetricsmentionedabove,2)theimportantparametersforthechosenclassifiers(i.e.J48decisiontree,SVM,NB,andRF)and3)theimpactofconfiguringourchosenclassifierswiththeidentifiedparamentsintermsofperformancemetricsincomparisonwiththedefaultsetupsforthebalancedandimbalanceddatasets.Answeringourresearchquestions,werealisedthatbalancingthedatasetdidnotimproveCA,TPrate,FPrate,P,R,andFingeneralbutitimprovedthetimetakentobuildthemodelandthetimetakentotestthemodel.Additionally,werealisedthatrunningtheclassifierswithdifferentparametersaffectedtheperformancemetrics(i.e.CA,TB,TT,TP,FP,P,R,andF)howevertheimpactwassignificantlystrongerontheimbalanceddatasetratherthanthebalanceddataset.
Forourfuturework,wewillinvestigateallthepossibleandpopularbalancingtechniquesandwillimplementthemonourextremelyimbalancedinsiderthreatdataset.Wewillthenanalyseandcomparetheirimpactsintermsofperformancemetricsmentionedabove(i.e.CA,TB,TT,TP,FP,P,R,andF).
ACKNoWLEDGMENT
TheauthorswouldwishtoacknowledgethesupportoftheEdinburghNapierUniversityandTheCyberAcademyforfundingthisworkand(ZoneFox,2017)forprovidingthedemoscenarios.
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
22
Table 5. Supervised machine learning results; Classification Accuracy (AC), Time taken to Build/Test the model (TB, TT), True Positive (TP), False Positive (FP), Precision (P), Recall (R), and F-measure (F)
Balanced and Imbalanced datasets
Classification Accuracy
TB TT TP FP Precision Recall F-measure
B U B U B U B U
J48(DF) * * * *
*
* * *
J48(2) * * * *
*
* * *
SVM(DF) * * * *
*
* * *
SVM(2) * * * *
*
* * *
NB(DF) * * * *
*
* * *
NB(SD) * * * *
*
* * *
NB(KE) * * * *
*
* * *
Total score 1 6 14 0 5 9 1 20
Table 6. Supervised machine learning results; Accuracy, Time taken to Build/Test the model (TB, TT), True Positive (TP), False Positive (FP), Precision (P), Recall (R), and F-measure (F)
Balanced dataset Accuracy TB TT TP FP Precision Recall F-measure Total score
J48(DF)J48(2)
e * e e e e e 1
e * e e e e e 1
SVM(DF)SVM(2)
e e e e e e e 0
e * e e e e e e 1
NB(DF)NB(SD)NB(KE)
e * * * * * 5
* * 2
e * * * * * 5
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
23
Table 7. Supervised machine learning results; Accuracy, Time taken to Build/Test the model (TB, TT), True Positive (TP), False Positive (FP), Precision (P), Recall (R), and F-measure (F)
Imbalanced dataset Accuracy TB TT TP FP Precision Recall F-measure Total score
J48(DF)J48(2)
* * * 3
* * * * * 5
SVM(DF)SVM(2)
* * * * * * * 7
* 1
NB(DF)NB(SD)NB(KE)
* 1
* * * e * * 5
* e 1
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
24
REFERENCES
Böse,B.,Avasarala,B.,Tirthapura,S.,Chung,Y.Y.,&Steiner,D. (2017).Detecting insider threatsusingradish:Asystemforreal-timeanomalydetectioninheterogeneousdatastreams.IEEE Systems Journal,11(2),471–482.doi:10.1109/JSYST.2016.2558507
Bradley,N.,Alvarez,M.,Kuhn,J.,&McMillen,D.(2015).IBM2015CyberSecurityIntelligenceIndex.
Buczak, A. L., & Guven, E. (2015). A survey of data mining and machine learning methods for cybersecurity intrusiondetection. IEEE Communications Surveys and Tutorials,18(2),1153–1176.doi:10.1109/COMST.2015.2494502
Cole,E.(2017).Defending Against the Wrong Enemy: 2017 SANS Insider Threat Survey.SANS.
Datapre-processing.(n.d.).Retrievedfromhttp://www.cs.ccsu.edu/~markov/ccsu_courses/datamining-3.html
Enterprise,V.(2017).2017Databreachinvestigationsreport.Zonefox.Retrievedfromhttps://zonefox.com
Galar,M.,Fernandez,A.,Barrenechea,E.,Bustince,H.,&Herrera,F.(2011).Areviewonensemblesfortheclassimbalanceproblem:Bagging-,boosting-,andhybrid-basedapproaches.IEEE Transactions on Systems, Man and Cybernetics. Part C, Applications and Reviews,42(4),463–484.doi:10.1109/TSMCC.2011.2161285
Gavai, G., Sricharan, K., Gunning, D., Hanley, J., Singhal, M., & Rolleston, R. (2015). Supervised andunsupervisedmethodstodetectinsiderthreatfromenterprisesocialandonlineactivitydata.JoWUA,6(4),47–63.
Gheyas,I.A.,&Abdallah,A.E.(2016).Detectionandpredictionofinsiderthreatstocybersecurity:Asystematicliteraturereviewandmeta-analysis.Big Data Analytics,1(1),6.doi:10.1186/s41044-016-0006-0
Greenberg,S.(1988).Usingunix:Collectedtracesof168users.
Hamed,T.,Ernst,J.B.,&Kremer,S.C.(2018).Asurveyandtaxonomyondataandpre-processingtechniquesofintrusiondetectionsystems.InComputer and Network Security Essentials(pp.113–134).Cham:Springer.doi:10.1007/978-3-319-58424-9_7
Hanifah,F.S.,Wijayanto,H.,&Kurnia,A.(2015).Smotebaggingalgorithmforimbalanceddatasetinlogisticregressionanalysis(case:Creditofbankx).Applied Mathematical Sciences,9(138),6857–6865.doi:10.12988/ams.2015.58562
Hashem,Y.,&Takabi,H.,GhasemiGol,M.,&Dantu,R.(2016).InsidetheMindoftheInsider:TowardsInsiderThreatDetectionUsingPsychophysiologicalSignals.Journal of Internet Services and Information Security,6(1),20–36.
Hassabis,D.,&Silver,D.(2017).Alphagozero:Learningfromscratch.deepMind.
Insiders,C.(2018).Insiderthreat-2018report.CATechnologies.Retrievedfromhttps://www.cert.org/insider-threat/tools/
Legg,P.A.,Buckley,O.,Goldsmith,M.,&Creese,S.(2015).Automatedinsiderthreatdetectionsystemusinguserandrole-basedprofileassessment.IEEE Systems Journal,11(2),503–512.doi:10.1109/JSYST.2015.2438442
Lindauer,B.,Glasser,J.,Rosen,M.,&Wallnau,K.C.(2014).GeneratingTestDataforInsiderThreatDetectors.JoWUA,5(2),80-94.
Moradpoor,N.,Brown,M.,&Russell,G.(2017,October).Insiderthreatdetectionusingprincipalcomponentanalysisandself-organisingmap.InProceedings of the 10th International Conference on Security of Information and Networks(pp.274-279).ACM.doi:10.1145/3136825.3136859
Parveen,P.,Evans,J.,Thuraisingham,B.,Hamlen,K.W.,&Khan,L.(2011,October).Insiderthreatdetectionusingstreamminingandgraphmining.InProceedings of the 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing(pp.1102-1110).IEEE.doi:10.1109/PASSAT/SocialCom.2011.211
Parveen,P.,&Thuraisingham,B.(2012,June).Unsupervisedincrementalsequencelearningforinsiderthreatdetection.InProceedings of the2012 IEEE International Conference on Intelligence and Security Informatics(pp.141-143).IEEEPress.doi:10.1109/ISI.2012.6284271
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
25
Pavlov,D.Y.,Gorodilov,A.,&Brunk,C.A.(2010,October).BagBoo:ascalablehybridbagging-the-boostingmodel.InProceedings of the 19th ACM international conference on Information and knowledge management(pp.1897-1900).ACM.
PonemonInstitute.(2016).2016 Cost of Insider Threats: Benchmark Study of Organizations in the United States.
Saarikoski,H.(2011),ExplanationofSMOParameters?Retrievedfromhttps://list.waikato.ac.nz/pipermail/wekalist/2010-December/050570.html
Silver,D.,Huang,A.,Maddison,C.J.,Guez,A.,Sifre,L.,VanDenDriessche,G.,...Dieleman,S.(2016).MasteringthegameofGowithdeepneuralnetworksandtreesearch.Nature,529(7587),484.
Singh,A.,&Patel,S.(2014).ApplyingmodifiedK-nearestneighbortodetectinsiderthreatincollaborativeinformationsystems.Ijirset.Com,3(6),14146–14151.
Torrey,L.&Shavlik,J.(2009).TransferLearning.InHandbookofResearchonMachineLearning.AcademicPress.
Tuor,A.,Kaplan,S.,Hutchinson,B.,Nichols,N.,&Robinson,S.(2017,March).Deeplearningforunsupervisedinsiderthreatdetectioninstructuredcybersecuritydatastreams.InWorkshops at the Thirty-First AAAI Conference on Artificial Intelligence.AAAIPress.
Veeramachaneni,K.,Arnaldo,I.,Korrapati,V.,Bassias,C.,&Li,K.(2016,April).AI^2:trainingabigdatamachinetodefend.InProceedings of the 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS)(pp.49-54).IEEEPress.
Weka.ClassJ48.(n.d.).Retrievedfromhttp://weka.sourceforge.net/doc.dev/weka/classifiers/trees/J48.html
Weka. Class NaiveBayes. (n.d.). Retrieved from http://weka.sourceforge.net/doc.dev/weka/classifiers/bayes/NaiveBayes.html
Weka. Class SMO. (n.d.). Retrieved from http://weka.sourceforge.net/doc.stable/weka/classifiers/functions/SMO.html
Weka.ClassSpreadSubsample.RetrievedApril01,2018,fromhttp://weka.sourceforge.net/doc.dev/weka/filters/supervised/instance/SpreadSubsample.html
Weka.(n.d.).Retrievedfromhttps://www.cs.waikato.ac.nz/ml/weka/
Williams,G.,Baxter,R.,He,H.,Hawkins,S.,&Gu,L.(2002,December).AcomparativestudyofRNNforoutlierdetectionindatamining.InProceedings of the 2002 IEEE International Conference on Data Mining, 2002(pp.709-712).IEEEPress.doi:10.1109/ICDM.2002.1184035
International Journal of Cyber Warfare and TerrorismVolume 10 • Issue 2 • April-June 2020
26
Naghmeh Moradpoor (PhD, MSc, BSc, CEH, FHEA, MBCS) is a lecturer for cybersecurity and networks, the program leader for MSc Advanced Security and Cybercrime, a core member of the Centre for Distributed Computing, Networking and Cybersecurity, and an active member of The Cyber Academy in the School of Computing (SoC) at Edinburgh Napier University. She is a Certified Ethical Hacker (CEH) and an active researcher since 2009. Naghmeh has a research background in machine learning for cybersecurity as well as next generation broadband access networks. This includes Industrial Control System (ICS) and Critical Infrastructure Protection (CIP) security analysis and countermeasures, detection and classification of various cybersecurity attacks for computer networks, data analytics and decision support for cybersecurity, intrusion detection and prevention systems (IDS/IPS) for Internet of Things (IoT), cyberbullying detection and prevention, and forensic similarity hashes. She holds a PhD degree (2013) in Computer Systems and Networks from Ulster University and MSc in Computer Systems and Networks (2009) from Greenwich University, UK. Naghmeh was awarded Outstanding Woman in Cyber at the Scottish Cyber Awards 2017. These Awards are a highly prestigious and flagship event and the first of its kind in Scotland where the best of the best in cybersecurity are being recognised and awarded for their passion, dedication and hard work raising standards in both industry and academia. She has actively participated in leading and organising many cybersecurity events. This includes: Post Graduate Cyber Security (PGCS’16) symposium sponsored by SICSA, Poster Competition for Women in Cyber Security (PCWiCS’17) sponsored by SICSA, Poster Competition in Cyber Security (PCiCS’18) sponsored by SICSA, The First International Conference on Cyber Security and Digital Forensics (CSDF’18) sponsored by SICSA NEXUS, and The First/Second International Workshop on Securing IoT Networks (SITN’17 & ‘18). Naghmeh has been an invited speaker for Women in Science Festival - Dundee (2015), Critical Infrastructure Protection in SEE-IT Summit (Cybersecurity track) - Serbia (2018) sponsored by SICSA and SICSA NEXUS, and Gender Equality & Women in Cybersecurity in SEE-IT Summit - Serbia (2018) sponsored by SICSA and SICSA NEXUS. She has been invited to serve as a program Chair, technical program committee member and a session Chair of major international conferences and an editorial board as well as a reviewer for prestigious journals since 2013. Some of these conferences and journals include: 7th to 11th international conference on Security of Information and Networks (SIN), Special Session on Industrial Automation and Process System’s Security (ISIE-IAPSS’17) in conjunction with IEEE ISIE’17, The First/ Second International Workshop on Secure Smart Societies in Next Generation Networks (SECSOC’17 & ‘18) in conjunction with IEEE iThings’17 & ‘18. Journal of Information Security and Applications (Elsevier), Journal of Computers & Security (Elsevier), Journal of Cyber Security Technology, IEEE Access Journal, and Journal of Optical Switching and Networking (Elsevier). In addition, Naghmeh has been involved in supervising Honours, summer interns, MSc, and PhD students since she achieved her PhD in 2013. Naghmeh is a Director of Studies (DoS) for two collaborative PhDs. The first one is a PhD research work on ICS & CIP security analysis and countermeasures using machine learning with a focus on clean water supply systems. It is an ongoing and successful collaborative research work between Edinburgh Napier University (School of Computing) and Edinburgh Napier University (School of Engineering & Built Environment). Secondly, Naghmeh acts as a DoS for a PhD project on cyberbullying detection using victim’s sentiment analysis in social media. It is also a current collaborative and successful research work between Edinburgh Napier University (School of Computing) and King Abdulaziz University in the Kingdom of Saudi Arabia. Naghmeh and her students (Honours, MSc and PhD) have produced significant and ongoing joint research publications.
top related