a dynamic rule creation based anomaly detection method for...

A Dynamic Rule Creation Based Anomaly

Detection Method for Identifying Security Breaches

in Log Records

Jakub Breier∗1 and Jana Branisova†2

1Physical Analysis and Cryptographic Engineering,Nanyang Technological University,

Singapore2Faculty of Informatics and Information Technologies,

Slovak University of Technology, Bratislava,Slovakia

Abstract

Evidence of security breaches can be found in log files, created byvarious network devices in order to provide information about theiroperation. Huge amount of data contained within these files usuallyprevents to analyze them manually, therefore it is necessary to utilizeautomatic methods capable of revealing potential attacks.

In this paper we propose a method for anomaly detection in logfiles, based on data mining techniques for dynamic rule creation. Tosupport parallel processing, we employ Apache Hadoop framework,providing distributed storage and distributed processing of data. Out-comes of our testing show potential to discover new types of breachesand plausible error rates below 10%. Also, rule generation and anomalydetection speeds are competitive to currently used algorithms, such asFP-Growth and Apriori.

1 Introduction

Information systems and computer networks provide information about theirstate and operation in the form of log records. These records are composedof log entries containing information related to a specific event, which canbe related to security [6]. Potential security breaches can be revealed byanalyzing log files and looking for anomalies that occurred at a certain time

∗[email protected]†[email protected]

1

during the device operation. Organizations today utilize complex informa-tion systems producing large amounts of log messages, making it infeasibleto analyze them manually. Automatized methods, if implemented correctly,can help us to aim at important records, revealing malicious code, non-privileged system access or resources usage, policy breaches and also identifythe source of these activities [13].

As stated in NIST SP800-92 [6], organizations should establish pro-cesses for log management, incorporate these processes into their policiesand clearly define the goals and requirements for log management.

There can be various types of log messages, commonly inspected onesusually originate from operating systems, network devices, web servers andvarious applications using the network. SANS Institute published a reportstating six most critical report categories that should be collected and ana-lyzed in order to identify possible breaches1:

1. Authentication and Authorization Reports

2. Systems and Data Change Reports

3. Network Activity Reports

4. Resource Access Reports

5. Malware Activity Reports

6. Failure and Critical Error Reports

In our work we focus on the third category, but the method can be easilyextended to other categories as well. Common types of security softwarecapable of capturing these log are antimalware software, IDS/IPS, remoteaccess software, web proxies, vulnerability management software, authen-tication servers, routers, firewalls, and network quarantine servers[6]. Totest our method we use a set from IDS and a set of Snort logs created byanalyzing this set. We use anomaly detection, with the help of data miningtechniques.

Our approach utilizes a dynamic rule creation method for detecting pos-sible breaches. A human intervention is minimized by detecting anomaliesfrom the normal system behavior and also, it is possible to discover newtypes of breaches. A problem with large data sets is effectively handled byApache Hadoop framework, which we have implemented using its MapRe-duce method. Hadoop-based algorithm processes data faster compared toalgorithm using tree-based structure and the framework makes it easy toadd several computing nodes for parallel log analysis.

The rest of the paper is organized as follows. Related work is stated inSection 2. Section 3 describes possibilities of anomaly detection, providing

1http://www.sans.edu/research/security-laboratory/article/sixtoplogcategories

2

overview of current methods in this field. Section 4 presents the design ofour solution, followed by Section 5, where we state results of the testing.Finally, Section 6 concludes this work.

2 Related Work

There are several works proposing usage of data mining methods in log fileanalysis process or in detecting security threats in general.

One of the first approaches utilizing data mining techniques for intrusiondetection was proposed by Lee and Stolfo in 1998 [7]. They implementedtwo algorithms for detecting algorithms, the association rules algorithm andthe frequent episodes algorithm. They showed that by analyzing audit datait is possible to discover patterns of intrusions.

Schultz et al. [9] proposed a method for detecting malicious executablesusing data mining algorithms. They have used several standard data miningtechniques in order to detect previously undetectable malicious executables.They found out that some methods, like Multi-Label Naive Bayes Classifica-tion can achieve very good results in comparison to standard signature-basedmethods.

Grace, Maheswari and Nagamalai [5] used data mining methods for an-alyzing web log files, for the purposes of getting more information aboutusers. In their work they described log file formats, types and contents andprovided an overview of web usage mining processes.

Frei and Rennhard [3] used a different approach to search for anoma-lies in log files. They created the Histogram Matrix, a log file visualizationtechnique that helps security administrators to spot anomalies. Their ap-proach works on every textual log file. It is based on a fact that humanbrain is efficient in detecting patterns when inspecting images, so the log fileis visualized in a form that it is possible to observe deviations from normalbehavior.

Fu et al. [4] proposed a technique for anomaly detection in unstructuredsystem logs that does not require any application specific knowledge. Theyalso included a method to extract log keys from free text messages. Theirfalse positive rate using Hadoop was around 13% and using SILK around24%.

Makanju, Zincir-Heywood and Milios [8] proposed a hybrid log alert de-tection scheme, using both anomaly and signature-based detection methods.

3 Detection Methods

According to Siddiqui [10], there are three main detection methods that areused for monitoring malicious activities: scanning, activity monitoring andintegrity check. Scanning is the most widely used detection method, based

3

on searching for pre-defined strings in files. Advanced version of scanningincludes heuristic scanning which searches for unusual commands or instruc-tions in a program. Activity monitoring simply monitors a file execution andobserves its behavior. Usually, APIs, system calls and hybrid data sourcesare monitored. Finally, integrity checking creates a cryptographic checksumfor chosen files and periodically checks for integrity changes.

According to OWASP2, there are four main types of information in everylog record: when, where, who and what. It is possible to further process thelog data and identify anomalies or correlations among different device logsonly if all the four attributes are present.

Data mining is relatively new approach for detecting malicious actionsin the system. It uses statistical and machine learning algorithms on a setof features derived from standard and non-standard behavior. It consistsof two phases: data collection and application of detection method on col-lected data. These two phases sometimes overlap in a way that selection ofparticular detection method can affect a data collection.

Data can be analyzed either statically or dynamically. Dynamic analysisobserves a program or a system during the execution, therefore it is precisebut time consuming. Static analysis uses reverse-engineering techniques. Itdetermines behavior by observing program structure, functionality or typesof operation. This technique is faster and it does not need as much com-putational power as dynamic analysis, but we get only approximation ofreality. We can also use hybrid analysis - first, a static analysis is used andif it does not achieve correct results, dynamic analysis is applied as well.

3.1 Detection Strategies

Every detection method includes a record analysis based on selected criteria.There are three categories of data detection:

• Anomaly detection - first, a system profile, functioning properly ac-cording to specification, is created. Then the method compares gath-ered data in order to check if a non-standard behavior can be labeledas an attack [12].

• Misuse detection - when using this method, a system profile containingsamples of various malicious programs, is created. Then, an algorithmtries to find a match with this profile. It can complement the anomalydetection method, but it is unable to recognize new or unknown attacktypes.

• Hybrid detection - is a combination of last two strategies. There isno need of creating a profile, this detection strategy creates a dataclassifiers based on data from both safe and unsafe sources.

2https://www.owasp.org

4

3.2 Knowledge Discovery in Databases

Data mining is the analysis step of the Knowledge Discovery in Databases(KDD) [2]. The main goal of KDD is to discover information in large datasets. KDD consists of following stages:

• Data cleaning

• Data integration - combines multiple data sources

• Data transformation - data is transformed or consolidated in a formappropriate for data mining

• Data mining - applies data analysis and discovery algorithms thatproduce patterns over the data

• Pattern evaluation - identifies important patterns representing certainlevel of knowledge

• Knowledge interpretation - uses visualization and presentation tech-niques for knowledge representation

3.3 Data Mining

Data mining is a process of (semi-)automatic knowledge extraction. It ispossible to apply this process on data in various forms, e. g. quantitativeform, text form or multimedia form. The main advantage of data mining is afast way of information gathering. When considering logging problem, datamining is applied after standard log analysis in order to provide outcome ofbetter quality. In this work we use two main data mining algorithm types:

• Classification is a process of data mapping to previously defined cate-gories. These categories have certain properties so that by examiningthe data it can be identified with some degree of accuracy to whichcategory it belongs. The goal of text classification is to find an ap-proximation of unknown function Φ : D×C → {true, false}, where Dis a set of text documents and C = {c1, c2, . . . , c|C|} is a set of prede-fined categories. Function Φ holds true for < di, cj > if the documentdi belongs to category cj . Function D × C → {true, false} whichapproximates Φ is known as a classifier.

There are several algorithms that could be used for this purpose.Tavallae et al. [11] made a comparative analysis of several algorithmsfor the log file data mining. According to their results, decision treealgorithms and Multi-Layer Perception method that makes a modelof artificial neural network for positive feedback gained success rateof 90%. Naive Bayesian a posteriori classification had success rate of80% and Support Vector Machine method achieved 60%.

5

• Clustering is a method that maps data into groups. These groups arenot defined in the beginning, therefore we do not know their properties.A group, called the cluster, is defined by the data it contains. In theend, instances belonging to one cluster are very similar. This datamining algorithm can therefore help to reveal the natural structure ofthe analyzed data.

If there are elements that do not belong to any cluster, we can usuallyobserve significant differences with comparison to other data belongingto some cluster. These elements can reveal a non-standard systembehavior, an anomaly. Such elements are called the outliers. Thanksto a capability of this algorithm to discover outliers, we can detect newor modified attack behavior that cannot be recognized by standarddetection methods.

3.4 Evaluation of Data Mining Algorithms

In order to determine how well the result corresponds to reality, it is nec-essary to evaluate the implemented data mining algorithm. For each typeof algorithm, a different evaluation method is used, because the propertiesof algorithms differ greatly. For example one algorithm can be very efficientin processing numerical data sets, other can be useful in sorting text docu-ments into classes. This part provided an overview of an evaluation of twopreviously described data mining algorithm types. These methods can bealso used for comparison of particular algorithms in the case of the samedata set usage:

• Classification evaluation: evaluation of these algorithms can be done inseveral ways. Usually, an approach based on measuring true positives,false positives, true negatives, and false negatives, is used. A Receiveroperating characteristic (ROC) can be used for this purpose, whichis a curve created by plotting the true positive rate against the falsepositive rate. ROC for the ideal case and for a random guess is depictedin Figure 1.

• Clustering evaluation: for clustering algorithms, there are two mainevaluation techniques. Internal evaluation means that the result isevaluated based on the data that was clustered itself. As examplesof internal evaluation can be two methods – Davies-Bouldin indexand Dunn index. On the other hand, external evaluation is based onthe data that was not used for clustering, like class labels and externalbenchmarks. Examples of such methods are Rand meaure, F-measure,and Jaccard index.

6

0

0.2

0.4

0.6

0.8

1

1.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Tru

e p

osi

tive

rat

e

False positive rate

ROC Curve

Ideal Random Guess

Figure 1: Receiver operating characteristic curve.

3.5 Parallel Processing

A huge amount of log data is generated every day and it is presumed thatthis amount will grow over time. Therefore, it is necessary to improve theprocess of the log analysis and make it more effective. We have chosenApache Hadoop3 technology for this purpose with the MapReduce [1] pro-gramming model. MapReduce is used for processing large data sets by usingtwo functions. Map function processes the data and generates a list in thekey-value form. Reduce function can be then used by the user for joining allthe values with the same key. Hadoop architecture is based on distributingthe data on every node in the system. The resulting model is simple, becauseMapReduce handles the parallelism, so the user does not have to take careabout load balancing, network performance or fault tolerance. Hadoop Dis-tributed File System can then effectively store and use the data on multiplenodes.

3.6 Static and Dynamic Rules

There are two general approaches for rule creation. A static rule-based cor-relation is based on static rules, defined by a user before the analysis. An-other approach is a dynamic rule creation, using an algorithmic data mining

3http://hadoop.apache.org/

7

techniques to define rules. We will briefly describe both approaches in thissubsection.

3.6.1 Static Rule-Based Correlation

First, we need to create a scenario of a situation we want to simulate. If,for example, there is a process running on one device and another processrunning on a different device at the same time and the combination of bothinduces a security breach, we have to design rules that model this scenario.On top of these rules we have to implement a correlation rules decidingwhether the situation we are dealing with is an attack, or not. Rules cancontain many parameters, e.g. time frame, pattern repeat, service type,port. Algorithm then checks the data from log files and finds the attackscenarios.

The main advantage of such approach is an ability to discover attacks byanalyzing correlations and therefore, to reveal breaches which could be hardto detect. There are specific languages enabling creation of rules and alsotools that provide effective and easy-to-use rule creation. For companies, itis easier to buy packages of predefined rules than employing many systemanalysts capable of creating organization-specific rules.

The main disadvantage of this approach is its high price, especially forthe maintenance. Modelling of each attack scenario is a non-trivial task,there are many possibilities of executing the same attack type and sometimesthose are non-deterministic. Also, attacks are evolving and new types arebeing created every day. That means, it is necessary to create new rulesover time and even then there is still a chance there are some undiscoveredunspecified attacks that can easily happen without noticing. There arecompanies providing Intrusion Detection Systems and Security Informationand Event Management systems, together with periodic maintenance.

3.6.2 Dynamic Rule Creation

This approach is being utilized for anomaly detection for less than twentyyears. Generated rules are usually in an if-then form. First, an algorithmcreates patterns that can be further processed into a set of rules specifyingwhich action should be taken.

Methods based on dynamic rule creation can solve the problem witha necessity of continual breach patterns update by searching for potentialbreaches that are not yet known. For example, in [9] authors implemented adata mining algorithm based on a cross-validation technique, achieving morethan twice as better breach detection compared to current static patternbased methods.

The main disadvantage is a complexity of analyzing a high dimensionaldata, such as log files. There are algorithms capable of handling such data,

8

Table 1: Binary Transformation Example.

Session Time Type of Service

00:00:02 telnet

00:00:04 http

00:00:05 telnet

Session Time ST

Type of Service ToS

ST1 ST2 ST3 ToS1 ToS2

1 0 0 1 0

0 1 0 0 1

0 0 1 1 0

ST1 00:00:02

ST2 00:00:04

ST3 00:00:05

ToS1 telnet

ToS2 http

but usually the space and computational complexity is high. Therefore, itis necessary to reduce data dimensions as much as possible.

4 Design

The main idea for the design of our solution is to minimize false positivesand false negatives and to make the anomaly identification process faster.

The steps of the algorithm are following. First, a testing phase is per-formed and rules are made from the testing data set. The outcome of thisphase is an anomaly profile that will be used to detect anomalies in networkdevices log files. For creating rules, log file is divided into blocks instead ofrows. A block is identified by the starting time, session duration and typeof a service. We will use a term ’transaction’ for particular block. Thisapproach allows us to create a rule based on several log files from differentdevices or systems, so that one transaction can contain information fromvarious sources. For creating uniform data sets, which can be processed bydifferent algorithms, each transaction is transformed in a binary string form.For a spatial recognition of a log record in transaction, each record in theoriginal log file will be given a new attribute - transaction ID. Creation oftransactions and rules are depicted in Fig. 2.

For the detection program it is necessary to be able to process variouslog file formats, therefore we decided to use configuration files which willhelp to determine each attribute position.

4.1 Data Transformation

Data transformation includes creation of a new data set that contains thebinary data only. The advantage of such data representation is ability toprocess it with various algorithms for association rules creation. Exampleof such a transformation is depicted in Table 1.

9

Loop

Not all recordswere processed

All records wereprocessed

Satisfies conditionsfor anomaly profile

Transform thetransaction

Create a rule

Does not satisfy conditionsfor anomaly profile

Does not belong to transaction

Belongs to transaction

Store the record

Load file fromtraining data set

Load configuration

Figure 2: Transaction and Rule Creation.

To avoid a problem with large dimension number by using binary rep-resentation of log records, we propose a data reduction. This reduction isachieved by inserting values into categories and using an interval represen-tative instead of a scalar or time value. Binary string contains a numericalvalue of 1 for values which are present in the record and a numerical valueof 0 otherwise. However, some of the values are unable to reduce, such asIP addresses or ports.

4.2 Transaction and Rule Creation

Transaction and rule creation algorithm works as follows. It loads eachrecord from log files line by line and stores them in the same block if theywere created in the same time division, within the same session and if theIP addresses and ports are identical. If they can be identified as related, atransaction is created. Then it is decided whether this transaction fulfillsconditions to be included in the anomaly profile. If yes, a new rule is created,if no, this transaction is ignored for the further rule creation process.

A rule contains attributes in a binary form that are defined in config-

10

uration file. It always contains some basic attributes related to time andsession parameters and also a device ID, from which particular log recordoriginates.

The anomaly finding algorithm first loads a set of previously created rulesfrom the database. Then it sequentially processes the log files intended foranalysis and creates transactions from these files. This transaction is thencompared with the set of rules and if it is identified as an anomaly, it isstored for further observations.

4.3 Processing of Large Data Sets

As stated in Section 3.5, we decided to use Hadoop technology with MapRe-duce programming model to process large data sets. Hadoop enables usto easily add processing nodes of independent device types. After programstarts, JobTracker and TaskTracker processes are started, which are respon-sible for Job coordination and for execution of Map and Reduce methods.JobTracker is a coordinator process, there exists only one instance of thisprocess and it manages TaskTracker processes which are instantiated on ev-ery node. MapReduce model is depicted in Fig. 3, however in our case, onlyone Reduce method instance is used. First, a file is loaded from Hadoop Dis-tributed File System (HDFS) and it is divided into several parts. Each Mapmethod accepts data from particular part line by line. It then processes theline and stores them until a rule is created (if all the conditions are met).The rule is then further processed by the Reduce method, which identifiesredundant rules and if the rule is unique, it is written into the HDFS.

sort/merge

sort/merge

copy

Output HDFSInput HDFS

Part 2

Part 1

reduce

reduce

map

map

map

Split 0Split 1Split 2Split 3Split 4

Figure 3: MapReduce Algorithm.

To allow nodes to access same files in the same time, but without loadingthem onto the each node separately, a Hadoop library for distributed cacheis used4.

4https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/filecache/DistributedCache.html

11

5 Testing

Our anomaly detection method was implemented in Java programming lan-guage. For testing, a following setup was used:

• CPU: Intel Core i7-4500U (1.8-3.0 GHz, 4MB cache, 2 cores, 4 threads)

• RAM: 8 GB

• Operating system: Ubuntu 12.04

For testing purposes, two data sets were used: 1998 DARPA IntrusionDetection Evaluation Set5, that was created by monitoring a system for twoweeks, and Snort logs, created by analyzing the DARPA data set6. Snortlogs contain information, if the attack was performed, or not. Based onthat, we were able to determine if our anomaly detection method was ableto successfully identify an intrusion, or not.

Testing was performed on a log records set of a size of 442 181 records.This set was made by merging DARPA and Snort data sets. We havesplit this data set into ten subsets for cross-validation purposes. In cross-validation, set is divided into subsets with similar sizes and each subset isused as many times as is the number of subsets. In each testing, one subsetis used as a test set and the other subsets are used as training sets. For ourvalidation, we split the main data set in a way that each subset containedlog records from every day when monitoring was performed.

5.1 Data Transformation

After data sets merging, it was necessary to determine how many uniquevalues are present in each table column. These values are stated in Table 2.

As we have already stated, high-dimensional data increases memory re-quirements of anomaly detection algorithm. It is possible to reduce someof the attribute values so that it can still be able to detect anomalies on areduced set. We can analyze the ’Session Time’ attribute and a process ofreducing its values into intervals. These intervals are stated in Table 3.

Since the majority of records has a session time value 00:00:01, it wasdecided to take this value as a standalone interval. The same holds for value00:00:02. Two other intervals cover longer time sessions, but since there arenot many values present in each of these intervals, it was possible to makethe reduction. Therefore, after reduction it was possible to change the rangeof values from 884 to 4 in this case, which enables significantly faster dataprocessing.

5http://www.ll.mit.edu/mission/communications/cyber/CSTcorpora/ideval/data/6https://www.snort.org

12

Table 2: Occurrence of Unique Values in Merged Data Set.

Attribute Unique Values Min Max

Date 10 07/20/1998 07/31/1998

Time 63 299 00:00:00 23:59:59

Session Time 884 00:00:01 17:50:36

Service 4664 n/a n/a

Source Port 38 637 - 65 404

Destination Port 7887 - 33 442

Source IP 1 565 000.000.000.000 209.154.098.104

Destination IP 2 640 012.005.231.198 212.053.065.245

Attack Occurred 2 0 1

Attack Type 47 n/a n/a

Alert 61 n/a n/a

Table 3: Intervals with Highest Number of Occurences.

Session Time 00:00:01 00:00:02 (00:00:02,01:00:00> (01:00:00,18:00:00)

Occurrences 795 421 13 873 7 987 759

5.2 True Positive and False Positive Rates

After applying generated rules and observing anomalies among transactions,we have to check how many anomalies have been identified correctly. It isdesirable to identify all the real anomalies and not to label normal commu-nication as an anomaly at the same time. Therefore, we have checked ourresults using true positive rate (TPR) and false positive rate (FPR) formulasto get outcomes of our algorithms implemented in Java and Hadoop. Theseformulas are stated in Equations 1 and 2, where FP = false positives, FN= false negatives, TP = true positives, TN = true negatives.

TPR =TP

TP + FN(1)

FPR =FP

FP + TN(2)

Table 4 shows TPR and FPR for Java and Hadoop implementations aftercross-validation for ten testing sets. Results show that Hadoop MapReducetechnology helps to reduce false positives by 1% and increases true positivesby 3%.

13

Table 4: True Positive and False Positive Rates for Java and Hadoop Im-plementations.

Java Hadoop

TPR FPR TPR FPR

Set 1 0.846989 0.083069 0.855534 0.073358Set 2 0.840526 0.072856 0.843952 0.06012Set 3 0.616683 0.080336 0.843952 0.060124Set 4 0.788934 0.068232 0.843952 0.060124Set 5 0.866599 0.084820 0.869818 0.073270Set 6 0.861434 0.080436 0.863636 0.072067Set 7 0.834294 0.076534 0.838545 0.063471Set 8 0.856624 0.074708 0.859672 0.063386Set 9 0.840459 0.073108 0.841219 0.058621Set 10 0.677185 0.078954 0.675759 0.068844

Overall 0.802972 0.077305 0.833604 0.065339

5.3 Error Rate

Overall error rate of the algorithm can be determined by Equation 3.

Error rate =FP + FN

TP + FP + FN + TN(3)

The anomaly detection algorithm was implemented both in Java and inHadoop. Table 5 shows values for both implementations. The table showsus that Hadoop implementation has around 1% lower error rate than Javaimplementation. Table results are also depicted in Figure 4.

5.4 Processing Speed

Important factor in anomaly detection is both speed of rules generationand speed of data processing. We compared our algorithm with two otheranomaly detection algorithms, Apriori and FP-Growth. Apriori algorithmserves as a base for several rule-creation methods. Its disadvantage is thatit needs to process the data set several times. FP-Growth algorithm usestree-based storages for storing intermediate values. We used Weka libraries7

for these algorithms implementations. Testing results are stated in Table 6.We can see that Hadoop implementation was the fastest among the testedalgorithms. Therefore we can conclude that parallelization can bring verygood results in terms of speed into the rule generation process.

Speed of data set processing for anomaly detection is stated in Table 7.We were comparing standard implementation in Java and implementation

7http://www.cs.waikato.ac.nz/ml/weka/

14

Table 5: Error Rate for Each Subset Using Java and Hadoop Implementa-tions.

Java Hadoop

Set 1 0.095 0.087Set 2 0.087 0.077Set 3 0.144 0.142Set 4 0.091 0.083Set 5 0.093 0.083Set 6 0.090 0.083Set 7 0.090 0.080Set 8 0.086 0.077Set 9 0.087 0.077Set 10 0.123 0.121

Overall 0.098576 0.091465

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Set 8 Set 9 Set 10

Error Rate (%)

Java Hadoop

Figure 4: Error Rate for Different Subsets Using Java and Hadoop Imple-mentations.

in Hadoop. Tests were performed on three data sets of sizes 10, 50, and500 GB. As we can see, Hadoop can speed up this process more than tentimes, even by using a single node. The Hadoop configuration was set topseudo-distributed operation, which allowed it to run on a single-node. It

15

Table 6: Comparison of Rule Generation Speed.

Algorithm Java Implementation Hadoop Implementation Apriori FP-Growth

Time (s) 163.1 15.6 226 93

Table 7: Comparison of Anomaly Detection Speed.Data Size 10 GB 50 GB 500 GB

Number of Records (in millions) 84 423 851

Java Implementation Time (s) 32 622 164 050 330 152

Hadoop Implementation Time (s) 3 164 13 531 29 042

is, of course, possible to add more nodes in order to improve throughput.We have tested a 10 GB data set on a three-node cluster, one node wasconfigured as a master+slave, the other two nodes were configured as slavesonly. Running time was 2040s, which gives us approximately 1.55 timesbetter throughput than using a single-node.

6 Conclusion

In this work we have analyzed data mining methods for anomaly detectionand proposed an approach for discovering security breaches in log records.Such approach generates rules dynamically from certain patterns in samplefiles and is able to learn new types of attacks, while minimizing the need ofhuman actions.

Our implementation utilizes Apache Hadoop framework for distributedstorage and distributed processing of data, allowing computations to runin parallel on several nodes and therefore, speeding up the whole process.Compared to our Java implementation, single-node cluster Hadoop imple-mentation performs more than ten times faster. Another optimization wasdone in transformation of data into binary form, making it more efficient toanalyze particular transactions.

For the future work, we would like to investigate possibilities of identify-ing correlations among several network devices with automatized methods.

Acknowledgement. This work was supported by VEGA 1/0722/12grant entitled “Security in distributed computer systems and mobile com-puter networks.”

16

References

[1] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing onLarge Clusters. Commun. ACM, 51(1):107–113, January 2008.

[2] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. Advances in Knowl-edge Discovery and Data Mining. chapter From Data Mining to Knowl-edge Discovery: An Overview, pages 1–34. American Association forArtificial Intelligence, Menlo Park, CA, USA, 1996.

[3] A Frei and M. Rennhard. Histogram matrix: Log file visualizationfor anomaly detection. In Availability, Reliability and Security, 2008.ARES 08. Third International Conference on, pages 610–617, March2008.

[4] Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution anomaly detectionin distributed systems through unstructured log analysis. In Proceed-ings of the 2009 Ninth IEEE International Conference on Data Mining,ICDM ’09, pages 149–158, Washington, DC, USA, 2009. IEEE Com-puter Society.

[5] L.K.J. Grace, V. Maheswari, and D. Nagamalai. Web log data analysisand mining. In Natarajan Meghanathan, BrajeshKumar Kaushik, andDhinaharan Nagamalai, editors, Advanced Computing, volume 133 ofCommunications in Computer and Information Science, pages 459–469.Springer Berlin Heidelberg, 2011.

[6] Karen Kent and Murugiah P. Souppaya. Sp 800-92. guide to computersecurity log management. Technical report, Gaithersburg, MD, UnitedStates, 2006.

[7] Wenke Lee and Salvatore J Stolfo. Data mining approaches for intrusiondetection. In Usenix Security, 1998.

[8] A Makanju, A.N. Zincir-Heywood, and E.E. Milios. Investigating eventlog analysis with minimum apriori information. In Integrated NetworkManagement (IM 2013), 2013 IFIP/IEEE International Symposiumon, pages 962–968, May 2013.

[9] M.G. Schultz, E. Eskin, E. Zadok, and S.J. Stolfo. Data mining methodsfor detection of new malicious executables. In Security and Privacy,2001. S P 2001. Proceedings. 2001 IEEE Symposium on, pages 38–49,2001.

[10] M. A. Siddiqui. Data mining methods for malware detection. ProQuest,2011.

17

[11] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani. A detailed anal-ysis of the kdd cup 99 data set. In Proceedings of the Second IEEE In-ternational Conference on Computational Intelligence for Security andDefense Applications, CISDA’09, pages 53–58, Piscataway, NJ, USA,2009. IEEE Press.

[12] R. Venkatesan. A Survey on Intrusion Detection using Data MiningTechniques. International Journal of Computers & Distributed Systems,2(1), 2012.

[13] R. Winding, T. Wright, and M. Chapple. System Anomaly Detection:Mining Firewall Logs. In Securecomm and Workshops, 2006, pages 1–5,Aug 2006.

18

a dynamic rule creation based anomaly detection method for...

Documents