integrated detection of anomalous behavior of computer ... · pdf fileii...

258
Politecnico di Milano Dipartimento di Elettronica e Informazione DOTTORATO DI RICERCA IN INGEGNERIA DELL’INFORMAZIONE Integrated Detection of Anomalous Behavior of Computer Infrastructures Doctoral Dissertation of: Federico Maggi Advisor: Prof. Stefano Zanero Tutor: Prof. Letizia Tanca Supervisor of the Doctoral Program: Prof. Patrizio Colaneri - XXII

Upload: nguyenmien

Post on 02-Mar-2018

230 views

Category:

Documents


6 download

TRANSCRIPT

Politecnico di MilanoDipartimento di Elettronica e Informazione

DOTTORATO DI RICERCA IN INGEGNERIADELL’INFORMAZIONE

Integrated Detection of Anomalous Behaviorof Computer Infrastructures

Doctoral Dissertation of:FedericoMaggi

Advisor:Prof. Stefano Zanero

Tutor:Prof. Letizia Tanca

Supervisor of the Doctoral Program:Prof. Patrizio Colaneri

- XXII

P MDipartimento di Elettronica e Informazione

Piazza Leonardo da Vinci , I- — Milano

i

Preface

is thesis embraces all the efforts that I put during the last threeyears as a PhD student at Politecnico di Milano. I have been work-ing under the supervision of Prof. S. Zanero and Prof. G. Serazzi,who is also the leader of the research group I am part of. In thistime frame I had the wonderful opportunity of being “initiated” toresearch, which radically changed the way I look at things: I foundmy natural “thinking outside the box” attitude — that was probablywell-hidden under a thick layer of lack-of-opportunities, I took partof very interesting joint works — among which the year I spent atthe Computer Security Laboratory at UC Santa Barbara is at thefirst place, and I discovered the Zen of my life.

My research is all about computers and every other technologypossibly related to them. Clearly, the way I look at computers haschanged a bit since when I was seven. Still, I can remember me,typing on that Commodore in front of a tube TV screen, tryingto get that d—n routine written in Basic to work. I was just playing,obviously, but when I recently found a picture of me in front of thatscreen...it all became clear.

∗∗∗∗ COMMODORE 64 BASIC X2 ∗∗∗∗64K RAM SYSTEM 38128 BASIC BYTES FREE

READY.

BENVENUTO NEL COMPUTER DI FEDERICO.

SCRIVI LA PAROLA D’ORDINE:

at says WELCOME TO FEDERICO’S COMPUTER.

TYPE IN THE PASSWORD. So, although my attempt ofwriting a program to authenticate myself was a little bit naive —being limited to a print instruction up to that point apart, of course— I thought “maybe I am not in the wrong place, and the fact that myresearch is still about security is a good sign”!

Many years later, this work comes to life. ere is a humongousamount of people that, directly or indirectly, have contributed tomyresearch and, in particular, to this work. Since my first step into the

ii

lab, I will not, ever, be thankful enough to Stefano, who, despite myskepticism, convinced me to submit that application for the PhDprogram. For trusting me since the very first moment I am thankfulto Prof. G. Serazzi as well, who has been always supportive. Forhosting and supporting my research abroad I thank Prof. G. Vigna,Prof. C. Kruegel, and Prof. R. Kemmerer. Also, I wish to thankProf. M. Matteucci for the great collaboration, Prof. I. Epifani forher insightful suggestions and Prof. H. Bos for the detailed reviewand the constructive comments.

On the colleagues-side of this acknowledgments I put all thefellows of Room , Guido, the crew of the seclab and, in partic-ular, Wil with whom I shared all the pain of paper writing betweenSept ’ and Jun ’.

On the friends-side of this list Lorenzo and Simona go first,for being our family.

I have tried to translate in simple words the infinite gratitudeI have and will always have to Valentina and my parents for beingmy fixed point in my life. Obviously, I failed.

F MMilano

September

iv

Abstract

is dissertation details our research on anomaly detec-tion techniques, that are central to several classic security-related tasks such as network monitoring, but it also havebroader applications such as program behavior characteriza-tion or malware classification. In particular, we worked onanomaly detection from three different perspective, with thecommon goal of recognizing awkward activity on computerinfrastructures. In fact, a computer system has several weakspots that must be protected to avoid attackers to take ad-vantage of them. We focused on protecting the operatingsystem, central to any computer, to avoid malicious code tosubvert its normal activity. Secondly, we concentrated onprotecting the web applications, which can be considered themodern, shared operating systems; because of their immensepopularity, they have indeed become the most targeted entrypoint to violate a system. Last, we experimented with noveltechniques with the aim of identifying related events (e.g.,alerts reported by intrusion detection systems) to build newand more compact knowledge to detect malicious activity onlarge-scale systems.

Our contributions regarding host-based protection sys-tems focus on characterizing a process’ behavior through thesystem calls invoked into the kernel. In particular, we en-gineered and carefully tested different versions of a multi-model detection system using both stochastic and determin-istic models to capture the features of the system calls duringnormal operation of the operating system. Besides demon-strating the effectiveness of our approaches, we confirmedthat the use of finite-state, deterministic models allow to de-tect deviations from the process’ control flowwith the highestaccuracy; however, our contribution combine this effective-ness with advanced models for the system calls’ argumentsresulting in a significantly decreased number of false alarms.

Our contributions regarding web-based protection sys-tems focus on advanced training procedures to enable learn-ing systems to performwell even in presence of changes in theweb application source code — particularly frequent in theWeb . era. We also addressed data scarcity issues that is areal problem when deploying an anomaly detector to protecta new, never-used-before application. Both these issues dra-matically decrease the detection capabilities of an intrusion

v

detection system but can be effectively mitigated by adopt-ing the techniques we propose.

Last, we investigated the use of different stochastic andfuzzy models to perform automatic alert correlation, which isas post processing step to intrusion detection. We proposeda fuzzy model that formally defines the errors that inevitablyoccur if time-based alert aggregation (i.e., two alerts are con-sidered correlated if they are close in time) is used. ismodel allow to account for measurements errors and avoidfalse correlations due to delays, for instance, or incorrect pa-rameter settings. In addition, we defined a model to describethe alert generation as a stochastic process and experimentedwith non-parametric statistical tests to define robust, zero-configuration correlation systems.

e aforementioned tools have been tested over differentdatasets— that are thoroughly documented in this document— and lead to interesting results.

vi

Sommario

Questa tesi descrive in dettaglio la nostra ricerca sulletecniche di anomaly detection. Tali tecniche sono fonda-mentali per risolvere problemi classici legati alla sicurezza,come per esempio il monitoraggio di una rete, ma hanno an-che applicazioni di piu ampio spettro come l’analisi del com-portamento di un processo in un sistema o la classificazionedi malware. In particolare, il nostro lavoro si concentra su treprospettive differenti, con lo scopo comune di rilevare atti-vita sospette in un sistema informatico. Difatti, un sistemainformatico ha diversi punti deboli che devono essere protettiper evitare che un aggressore possa approfittarne. Ci siamoconcentrati sulla protezione del sistema operativo, presentein qualsiasi computer, per evitare che un programma possaalterarne il funzionamento. In secondo luogo ci siamo con-centrati sulla protezione delle applicazioni web, che possonoessere considerate il moderno sistema operativo globale; in-fatti, la loro immensa popolarita ha fatto si che diventasseroil bersaglio preferito per violare un sistema. Infine, abbia-mo sperimentato nuove tecniche per identificare relazioni traeventi (e.g., alert riportati da sistemi di intrusion detection)con lo scopo di costruire nuova conoscenza per poter rilevareattivita sospette su sistemi di larga-scala.

Riguardo ai sistemi di anomaly detection host-based cisiamo focalizzati sulla caratterizzazione del comportamen-to dei processi basandoci sul flusso di system call invocatenel kernel. In particolare, abbiamo ingegnerizzato e valuta-to accuratamente diverse versioni di un sistema di anoma-ly detection multi-modello che utilizza sia modelli stocasti-ci che modelli deterministici per catturare le caratteristichedelle system call durante il funzionamento normale del si-stema operativo. Oltre ad aver dimostrato l’efficacia dei no-stri approcci, abbiamo confermato che l’utilizzo di modellideterministici a stati finiti permettono di rilevare con estre-ma accuratezza quando un processo devia significativamen-te dal normale control flow; tuttavia, l’approccio che propo-niamo combina tale efficacia con modelli stocastici avanzatipermodellizzare gli argomenti delle system call per diminuiresignificativamente il numero di falsi allarmi.

Riguardo alla protezione delle applicazioni web ci siamofocalizzati su procedure avanzate di addestramento. Lo sco-po e permettere ai sistemi basati su apprendimento non su-

vii

pervisionato di funzionare correttamente anche in presenzadi cambiamenti nel codice delle applicazioni web — feno-meno particolarmente frequente nell’era del Web .. Ab-biamo anche affrontato le problematiche dovute alla scari-sita di dati di addestramento, un ostacolo piu che realisticospecialmente se l’applicazione da proteggere non e mai sta-ta utilizzata prima. Entrambe le problematiche hanno comeconseguenza un drammatico abbassamento delle capacita didetection degli strumenti ma possono essere efficacementemitigate adottando le tecniche che proponiamo.

Infine abbiamo investigato l’utilizzo di diversi modelli,sia stocastici che fuzzy, per la correlazione di allarmi auto-matica, fase successiva alla rilevazione di intrusioni. Abbia-mo proposto un modello fuzzy che definisce formalmente glierrori che inevitabilmente avvengono quando si adottano al-goritmi di correlazione basati sulla distanza nel tempo (i.e.,due allarmi sono considerati correlati se sono stati riportatipiu o meno nello stesso istante di tempo). Questo model-lo permette di tener conto anche di errori di misurazione edevitare decisioni scorrete nel caso di ritardi di propagazione.Inoltre, abbiamo definito un modello che descrive la genera-zione di allarmi come un processo stocastico e abbiamo spe-rimentato con dei test non parametrici per definire dei criteridi correlazione robusti e che non richiedono configurazione.

Contents

Introduction . Todays’ Security reats . . . . . . . . . . . . . .

.. e Role of Intrusion Detection . . . . . . . Original Contributions . . . . . . . . . . . . . . .

.. Host-based Anomaly Detection . . . . . . .. Web-based Anomaly Detection . . . . . . .. Alert Correlation . . . . . . . . . . . . . .

. Document Structure . . . . . . . . . . . . . . . .

DetectingMalicious Activity . Intrusion Detection . . . . . . . . . . . . . . . . .

.. Evaluation . . . . . . . . . . . . . . . . . .. Alert Correlation . . . . . . . . . . . . . . .. Taxonomic Dimensions . . . . . . . . . .

. Relevant Anomaly Detection Techniques . . . . . .. Network-based techniques . . . . . . . . . .. Host-based techniques . . . . . . . . . . . .. Web-based techniques . . . . . . . . . . .

. Relevant Alert Correlation Techniques . . . . . . . Evaluation Issues and Challenges . . . . . . . . .

.. Regularities in audit data of IDEVAL . . . .. e base-rate fallacy . . . . . . . . . . . .

. Concluding Remarks . . . . . . . . . . . . . . . .

Host-based Anomaly Detection . Preliminaries . . . . . . . . . . . . . . . . . . . . . Malicious System Calls Detection . . . . . . . . .

.. Analysis of SyscallAnomaly . . . . . . . .

viii

CONTENTS ix

.. Improving SyscallAnomaly . . . . . . . . . .. Capturing process behavior . . . . . . . . .. Prototype implementation . . . . . . . . . .. Experimental Results . . . . . . . . . . . .

. Mixing Deterministic and Stochastic Models . . . .. Enhanced Detection Models . . . . . . . . .. Experimental Results . . . . . . . . . . . .

. Forensics Use of Anomaly Detection Techniques . .. Problem statement . . . . . . . . . . . . . .. Experimental Results . . . . . . . . . . . .

. Concluding Remarks . . . . . . . . . . . . . . . .

Anomaly Detection ofWeb-based Attacks . Preliminaries . . . . . . . . . . . . . . . . . . . .

.. Anomaly Detectors of Web-based Attacks .. AComprehensiveDetection System toMit-

igate Web-based Attacks . . . . . . . . . . .. Evaluation Data . . . . . . . . . . . . . .

. Training With Scarce Data . . . . . . . . . . . . . .. Non-uniformly distributed training data . .. Exploiting global knowledge . . . . . . . . .. Experimental Results . . . . . . . . . . . .

. Addressing Changes in Web Applications . . . . . .. Web Application Concept drift . . . . . . .. Addressing concept drift . . . . . . . . . . .. Experimental Results . . . . . . . . . . . .

. Concluding Remarks . . . . . . . . . . . . . . . .

Network and Host Alert Correlation . Fuzzy Models and Measures for Alert Fusion . . .

.. Time-based alert correlation . . . . . . . . . Mitigating Evaluation Issues . . . . . . . . . . . .

.. A common alert representation . . . . . . .. Proposed Evaluation Metrics . . . . . . . . .. Experimental Results . . . . . . . . . . . .

. Using Non-parametric Statistical Tests . . . . . . .. e Granger Causality Test . . . . . . . . .. Modeling alerts as stochastic processes . .

. Concluding Remarks . . . . . . . . . . . . . . . .

x CONTENTS

Conclusions

Bibliography

Index

List of Acronyms

List of Figures

List of Tables

Introduction

Network connected devices such as personal computers, mobilephones, or gaming consoles are nowadays enjoying immense pop-ularity. In parallel, the Web and the humongous amount of ser-vices it offers have certainly became the most ubiquitous tools ofall the times. Facebook counts more than millions active usersof which millions are using it on mobile devices; not to men-tion that more than billion photos are uploaded to the site eachmonth [Facebook, ]. And this is just one, popular website.One year ago, Google estimated that the approximate number ofunique Uniform Resource Locators (URLs) is trillion [Alpert andHajaj, ], while YouTube has stockedmore than million videosas of March , with ,, views just on the most popularvideo as of January [Singer, ]. And people from all overthe world inundate the Web with more than million tweets perday. Not only the Web . has became predominant; in fact, think-ing that on December the Internet was made of one site andtoday it counts more than million sites is just astonishing [Za-kon, ].

e Internet and theWeb are huge [MiniwattsMarketingGrp.,]. e relevant fact, however, is that they both became themost advanced workplace. Almost every industry connected itsown network to the Internet and relies on these infrastructures for

. I

a vast majority of transactions; most of the time monetary transac-tions. As an example, every year Google looses approximately millions of US Dollars in ignored ads because of the “I’m feelinglucky” button. e scary part is that, during their daily work activi-ties, people typically pay poor or no attention at all to the risks thatderive from exchanging any kind of information over such a com-plex, interconnected infrastructure. is is demonstrated by theeffectiveness of social engineering [Mitnick, ] scams carriedover the Internet or the phone [Granger, ]. Recall that of the phishing is related to finance. Now, compare this landscapeto what the most famous security quote states.

“e only truly secure computer is one buried inconcrete, with the power turned off and the networkcable cut”.

—Anonymous

In fact, the Internet is all but a safe place [Ofer Shezaf andJeremiahGrossman andRobert Auger, ], withmore than ,known data breaches between and [Clearinghouse, ]and an estimate of ,, records stolen by intruders. Onemay wonder why the advance of research in computer security andthe increased awareness of governments and public institutions arestill not capable of avoiding such incidents. Besides the fact thatthe aforementioned numbers would be order of magnitude higherin absence of countermeasures, todays’ security issues are, basically,caused by the combination of two phenomena: the high amount ofsoftware vulnerabilities and the effectiveness of todays’ exploitationstrategy.

software flaws — (un)surprisingly, software is affected by vulner-abilities. Incidentally, tools that have to do with the Web,namely, browsers and rd-party extensions, and web applica-tions, are the most vulnerable ones. For instance, in , Se-cunia reported around security vulnerabilities for MozillaFirefox, for Internet Explorer’s ActiveX [Secunia, ].Office suites and e-mail clients, that are certainly the must-have-installed tool on every workstation, hold the second po-sition [e SANS Institute, ].

.. Todays’ Security reats

massification of attacks — in parallel to the explosion of the Web., attackers and the underground economy have quicklylearned that a sweep of exploits run against every reachablehost have more chances to find a vulnerable target and, thus,is much more profitable compared to a single effort to breakinto a high-value, well-protected machine.

ese circumstances have initiated a vicious circle that providesthe attackers with a very large pool of vulnerable targets. Vul-nerable client hosts are compromised to ensure virtually unlimitedbandwidth and computational resources to attackers, while serverside applications are violated to host malicious code used to in-fect client visitors. And so forth. An old fashioned attacker wouldhave violated a single site using all the resources available, stolendata and sold it to the underground market. Instead, a modernattacker adopts a “vampire” approach and exploit client-side soft-ware vulnerabilities to take (remote) control of million hosts. Inthe past the diffusion of malicious code such as viruses was sus-tained by sharing of infected, cracked software through floppy orcompact disks; nowadays, the Web offers unlimited, public storageto attackers that deploy their exploit on compromised websites.

us, not only the type of vulnerabilities has changed, posingvirtually every interconnected device at risk. e exploitation strat-egy created new types of threats that take advantage of classic ma-licious code patterns but in a new, extensive, and tremendously ef-fective way.

. Todays’ Securityreats

Every year, new threats are discovered and attacker take advan-tage of them until effective countermeasures are found. en, newthreats are discovered, and so forth. Symantec quantifies the amountof new malicious code threats to be ,, as of [Turneret al., ], , one year earlier and only , in .us, countermeasures must advance at least with the same growrate. In addition:

[...] the current threat landscape — such as the in-creasing complexity and sophistication of attacks, the

. I

F .: Illustration taken from [Holz, ] and c⃝IEEE. Authorized license limited to Politecnico di Milano.

evolution of attackers and attack patterns, and mali-cious activities being pushed to emerging countries —show not just the benefits of, but also the need for in-creased cooperation among security companies, gov-ernments, academics, and other organizations and in-dividuals to combat these changes [Turner et al., ].

Todays’ underground economy run a very proficientmarket: ev-eryone can buy credit card information for as low as $.–$, fullidentities for just $.–$ or rent a scam hosting solution for $–$ per week plus $-$ for the design [Turner et al., ].

e main underlying technology actually employs a classic typeof software called bot (jargon for robot), which is not malicious perse, but is used to remotely control a network of compromised hosts,called botnet [Holz, ]. Remote commands can be of any typeand typically include launching an attack, starting a phishing orspam campaign, or even updating to the latest version of the botsoftware by downloading the binary code from a host controlled by

.. Todays’ Security reats

the attackers (usually called bot master) [Stone-Gross et al., ].e exchange good has now become the botnet infrastructure itselfrather than the data that can be stolen or the spam that can be sent.ese are mere outputs of todays’ most popular service offered forrent by the underground economy.

.. e Role of Intrusion Detectione aforementioned, dramatic big picture may lead to think thatthe malicious software will eventually proliferate at every host ofthe Internet and no effective remediation exists. However, a morecareful analysis reveals that, despite the complexity of this scenario,the problems that must be solved by a security infrastructure canbe decomposed into relatively simple tasks that, surprisingly, mayalready have a solution. Let us look at an example.

Example .. is is how a sample exploitation can be structured:

injection — a malicious request is sent to the vulnerable web ap-plication with the goal of corrupting all the responses sentto legitimate clients from that moment on. For instance,more than one releases of the popular WordPress blog appli-cation are vulnerable to injection attacks that allow an at-tacker to permanently include arbitrary content to the pages.Typically, such an arbitrary content is malicious code (e.g.,JavaScript, VBSCrip, ActionScript, ActiveX) that, every timea legitimate user requests the infected page, executes on theclient host.

infection — Assuming that the compromised site is frequently ac-cessed — this might be the realistic case of the WordPress-powered ZDNet news blog — a significant amount of clientsvisit it. Due to the high popularity of vulnerable browsersand plug-ins, the client may run InternetExplorer — that isthe most popular — or an outdated release of Firefox onWin-dows. is create the perfect circumstances for the maliciouspage to successfully execute. In the best case, it may down-load a virus or a genericmalware from awebsite under control

http://secunia.com/advisories/http://wordpress.org/showcase/zdnet/

. I

of the attacker, so infecting the machine. In the worst case,this code may also exploit specific browser vulnerabilities andexecute in privileged mode.

control & use — e malicious code just download installs andhides itself onto the victim’s computer, which has just joineda botnet. As part of it, the client host can be remotely con-trolled by the attackers who can, for instance, rent it, use itsbandwidth and computational power along with other com-puters to run a distributed Denial of Service (DoS) attack.Also, the host can be used to automatically perform the sameattacks described above against other vulnerable web appli-cations. And so forth.

is simple yet quite realistic example shows the various kindsof malicious activity that are generated during a typical drive-byexploitation. It also shows its requirements and assumptions thatmust hold to guarantee success. More precisely, we can recognize:

network activity — clearly, the whole interaction relies on a net-work connection over the Internet: the HyperText TransferProtocol (HTTP) connections used, for instance, to down-load the malicious code as well as to launch the injection at-tack used to compromise the web server.

host activity — similarly to every other type of attack against anapplication, when the client-side code executes, the browser(or one of its extension plug-ins) is forced to behave improp-erly. If the malicious code executes till completion the attacksucceeds and the host is infected. is happens only if theplatform, operating system, and browser all match the re-quirements assumed by the exploit designer. For instance,the attack may succeed on Windows and not on MacOS X,although the vulnerable version of, say, Firefox is the same onboth the hosts.

HTTP traffic — in order to exploit the vulnerability of the webapplication, the attacking clientmust generatemaliciousHTTPrequests. For instance, in the case of an Structured QueryLanguage (SQL) injection — that is the second most com-mon vulnerability in a web application — instead of a regular

.. Todays’ Security reats

GET /index.php?username=myuser

the web server might be forced to process aGET /index.php?username=’ OR ’x’=’x’--\&content=<scriptsrc=”evil.com/code.js”>

that causes the index.php page to behave improperly.

It is now clear that protection mechanisms that analyze the net-work traffic, the activity of the client’s operating system, the webserver’s HTTP logs, or any combination of the three, have chancesof recognizing that something malicious is happening in the net-work. For instance, if the Internet Service Provider (ISP) networkadopt Snort, a lightweight Intrusion Detection System (IDS) that an-alyzes the network traffic for known attack patterns, could block allthe packets marked as suspicious. is would prevent, for instance,the SQL injection to reach the web application. A similar protec-tion level can be achieved by using other tools such as ModSecu-rity [Ristic, ]. One of the problems that may arise with theseclassic, widely adopted solutions is if a zero day attack is used. Azero day attack or threat exploits a vulnerability that is unknown tothe public, undisclosed to the software vendor, or a fix is not avail-able; thus, protection mechanisms that merely blacklist known ma-licious activity immediately become ineffective. In a similar vein,if the client is protected by an anti-virus, the infection phase canbe blocked. However, this countermeasure is once again successfulonly if the anti-virus is capable of recognizing the malicious code,which assumes that the code is known to be malicious.

Ideally, an effective and comprehensive countermeasure can beachieved if all the protection tools involved (e.g., client-side, server-side, network-side) can collaborate together. For instance, if a web-site is publicly reported to bemalicious, a client-side protection toolshould block all the content downloaded from that particular web-site. is is only a simple example.

us, countermeasures against todays’ threats already exist butare subject to at least two drawbacks:

• they offer protection only against known threats. To be ef-fective we must assume that all the hostile traffic can be enu-merated, which is clearly an impossible task.

. I

Why is “Enumerating Badness” a dumb idea?It’s a dumb idea because sometime around the amount of Badness in the Internet began tovastly outweigh the amount of Goodness. For ev-ery harmless, legitimate, application, there are dozensor hundreds of pieces of malware, worm tests, ex-ploits, or viral code. Examine a typical antiviruspackage and you’ll see it knows about ,+ virusesthat might infect your machine. Compare that tothe legitimate or so apps that I’ve installed onmy machine, and you can see it’s rather dumb totry to track , pieces of Badness when even asimpleton could track pieces ofGoodness [Ranum,].

• they lack of cooperation, which is crucial to detect global andslow attacks.

is said, we conclude that classic approaches such as dynamicand static code analysis and IDS already offer good protection butindustry and research should move toward methods that require lit-tle or no knowledge. In this work, we indeed focus on the so calledanomaly-based approaches, i.e., those that attempt to recognize thethreats by detecting any variation from a system’s normal operation,rather than looking for signs of known-to-be-malicious activity.

. Original Contributions

Ourmain research area is IntrusionDetection (ID). In particular, wefocus on anomaly-based approaches to detect malicious activities.Since todays’ threats are complex, a single point of inspection isnot effective. A more comprehensive monitoring system is moredesirable to protect both the network, the applications running on acertain host, and the web applications (that are particularly exposeddue to the immense popularity of the Web). Our contributionsfocus on the mitigation of both host-based and web-based attacks,along with two techniques to correlate alerts from hybrid sensors.

.. Original Contributions

.. Host-based Anomaly DetectionTypical malicious processes can be detected by modeling the char-acteristics (e.g., type of arguments, sequences) of the system callsexecuted by the kernel, and by flagging unexpected deviations asattacks. Regarding this type of approaches, our contributions focuson hybrid models to accurately characterize the behavior of a binaryapplication. In particular:

• we enhanced, re-engineered, and evaluated a novel tool formodeling the normal activity of the Linux . kernel. Com-pared to other existing solutions, our system shows better de-tection capabilities and good contextualization of the alertsreported. ese results are detailed in Section ..

• We engineered and evaluated an IDS to demonstrate thatthe combined use of () deterministic models to characterizea process’ control flow and () stochastic models to capturenormal features of the data flow, lead to better detection ac-curacy. Compared to the existing deterministic and stochas-tic approaches separately, our system shows better accuracy,with almost zero false positives. ese results are detailed inSection ..

• We adapted our techniques for forensics investigation. Byrunning experiments on real-world data and attacks, we showthat our system is able to detect hidden tamper evidence al-though sophisticated anti-forensics tools (e.g., userland pro-cess execution) have been used. ese results are detailed inSection ..

.. Web-based Anomaly DetectionAttempts of compromising a web application can be detected bymodeling the characteristics (e.g., parameter values, character dis-tributions, session content) of the HTTP messages exchanged be-tween servers and clients during normal operation. is approachcan detect virtually any attempt of tampering with HTTP mes-sages, which is assumed to be evidence of attack. In this researchfield, our contributions focus on training data scarcity issues alongwith the problems that arise when an application changes its legitbehavior. In particular:

. I

• we contributed to the development of a system that learnsthe legit behavior of a web application. Such a behavior isdefined by means of features extracted from ) HTTP re-quests, ) HTTP responses, ) SQL queries to the underly-ing database, if any. Each feature is extracted and learned byusing differentmodels, some of which are improvements overwell-known approaches and some others are original. emain contribution of this work is the combination of databasequery models with HTTP-based models. e resulting sys-tem has been validated through preliminary experiments thatshown very high accuracy. ese results are detailed in Sec-tion ...

• we developed a technique to automatically detect legit changesin web applications with the goal of suppressing the largeamount of false detections due to code upgrades, frequent intodays’ web applications. We run experiments on real-worlddata to show that our simple but very effective approach ac-curately predict changes in web applications and can distin-guish good vs. malicious changes (i.e., attacks). ese resultsare detailed in Section ..

• We designed and evaluated a machine learning technique toaggregate IDS models with the goal of ensuring good detec-tion accuracy even in case of scarce training data available.Our approach relies on clustering techniques and nearest-neighbor search to look-up well-trained models used to re-place under-trained ones that are prone to overfitting andthus false detections. Experiments on real-world data haveshown that almost every false alert due to overfitting is avoidedwith as low as - training samples per model. ese re-sults are described in Section ..

Although these techniques have been developed on top of aweb-based anomaly detector, they are sufficiently generic to be eas-ily adapted to other systems using learning approaches.

.. Alert CorrelationIDS alerts are usually post-processed to generate compact reportsand eliminate redundant, meaningless, or false detections. In this

.. Document Structure

research field, our contributions focus on unsupervised techniquesapplied to aggregate and correlate alert events with the goal of re-ducing the effort of the security officer. In particular:

• We developed and tested an approach that accounts for thecommon measurement errors (e.g., delays and uncertainties)that occur in the alert generation process. Our approach ex-ploits fuzzy metrics both to model errors and to constructan alert aggregation criterion based on distance in time. istechnique has been show to be more robust compared to clas-sic time-distance based aggregationmetrics. ese results aredetailed in Section ..

• We designed and tested a prototype that models the alertgeneration process as a stochastic process. is setting al-lowed us to construct a simple, non-parametric hypothesistest that can detect whether two alert streams are correlatedor not. Besides its simplicity, the advantage of our approachis to not requiring any parameter. ese results are describedin Section ..

e aforementioned results have been published in the proceed-ings of international conferences and international journals.

. Document Structure

is document is structured as follows. Chapter introduces theID, that is the topic of our research. In particular, Chapter rigor-ously describes all the basic components that are necessary to definethe ID task and an IDSs. e reader with knowledge on this subjectmay skip the first part of the chapter and focus on Section . and. that include a comprehensive review of the most relevant andinfluential modern approaches on network-, host-, web-based IDtechniques, along with a separate overview of the alert correlationapproaches.

As described in Section ., the description of our contributionsis structured into three chapters. Chapter focuses on host-basedtechniques, Chapter regards web-based anomaly detection, whileChapter described two techniques that allow to recognize rela-tions between alerts reported by network- and host-based systems.Reading Section .. is recommended before reading Chapter .

. I

e reader interested in protection techniques for the operatingsystem can skim through Section .. and then read Chapter .e reader with interests on web-based protection techniques canread Section .. and then Chapter . Similarly, the reader inter-ested in alert correlation systems can skim through Section ..and .. and then read Chapter .

Detecting Malicious Activity

Malicious activity is a generic term that refers to automated orman-ual attempts of compromising the security of a computer infrastruc-ture. Examples of malicious activity include the output generated(or its effect on a system) by the execution of simple script kiddies,viruses, DoS attacks, exploits of Cross-Site Scripting (XSS) or SQLinjection vulnerabilities, and so forth. is chapter describes theresearch tools and methodologies available to detect and mitigatemalicious activity on a network, a single host, a web-server and thecombination of the three.

First, the background concepts and the ID problem are de-scribed in this chapter along with a taxonomic description of themost relevant aspects of an IDS. Secondly, a detailed survey of theselected state-of-the-art anomaly detection approaches is presentedwith the help of further classification dimensions. In addition, theproblem of alert correlation, that is an orthogonal topic, is describedand the most relevant, recent research approaches are overviewed.Last but not least, the problem of evaluating an IDS is presentedto provide the essential terminology and criteria to understand theeffectiveness of both the reviewed literature and our original con-tributions.

. D M A

ISP network

Internet

ClientsCu

stom

ers'

clie

nts

Broadbandnetwork

Virtual hostingCustomers' servers DBs

= Anti-malware

= host-based IDS

= network-based IDS

Deployable protection mechanisms

= web-based IDS

F .: Generic and comprehensive deployment scenario forIDSs.

. Intrusion Detection

ID is the practice of analyzing a system to find malicious activity.is section defines this concept more precisely by means of sim-ple but rigorous definitions. e context of such definitions is thegeneric scenario of an Internet site, for instance, the network of anISP. An example is depicted in Figure .. An Internet site —and,in general, the Internet itself— is the state-of-the-art computer in-frastructure. In fact, it is a network that adopts almost any kind ofknown computer technology (e.g., protocols, standards, contain-ment measures), it runs a rich variety of servers such as HTTP,File Transfer Protocol (FTP), Secure SHell (SSH), Virtual PrivateNetwork (VPN) to support a broad spectrum of sophisticated ap-plications and services (e.g., web applications, e-commerce, appli-cations, the FacebookPlatform, GoogleApplications). In addition, ageneric Internet site receives and process traffic from virtually anyuser connected to the Net and thus represents the perfect researchtestbed for IDS.

Definition .. (System) A system is the domain of interest forsecurity.

Note that a system can be a physical or a logical one. For instance,

.. Intrusion Detection

a physical system could be a host (e.g., theDataBase (DB) server orone of the client machines shown in the figure), a whole network(e.g., the ISP network shown in the figure); a logical system can bean application, a service, such as one of the web services run in avirtual machine deployed in the ISP network. While running, eachsystem produces activity, that we define as follows.

Definition .. (System activity) A system activity is any data gen-erated while the system is working. Such activity is formalized asequence of events I = [I1, I2, Ii, . . . , IN ].

For instance, each of the clients of Figure . produces system logs:in this case I would contain an Ii for each log entry. A humanreadable representation follows.chatWithContact:(null)] got a nil targetContact. Aug 18 00:29:40[0x0-0x1b01b].org.mozilla.firefox[0]: NPP_Initialize called Aug 1800:29:40 [0x0-0x1b01b].org.mozilla.firefox[0]: 2009-08-18 00:29:40.039firefox-bin[254:10b] NPP_New(instance=0x169e0178,mode=2,argc=5) Aug 1800:29:40 [0x0-0x1b01b].org.mozilla.firefox[0]: 2009-08-18 00:29:40.052firefox-bin[254:10b] NPP_NewStream end=396239

Similarly, the web servers generates HTTP requests and re-sponses: in this case Iwould contain an Ii for each HTTP message.Its human readable representation follows./media//images/favicon.ico HTTP/1.0” 200 1150 ”-” ”Mozilla/5.0(Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10)Gecko/2009042315 Firefox/3.0.10 Ubiquity/0.1.4” 128.111.48.4[20/May/2009:15:26:44 -0700] ”POST /report/ HTTP/1.0” 200 19171”http://www.phonephishing.info/report/” ”Mozilla/5.0 (Macintosh; U;Intel Mac OS X 10.5; en-US; rv:1.9.0.10) Gecko/2009042315Firefox/3.0.10 Ubiquity/0.1.4” 128.111.48.4 [20/May/2009:15:26:44-0700] ”GET /media//css/main.css HTTP/1.0” 200 5525”http://www.phonephishing.info/report/” ”Mozilla/5.0 (Macintosh; U;Intel Mac OS X 10.5; en-US; rv:1.9.0.10) Gecko/2009042315Firefox/3.0.10 Ubiquity/0.1.4” 128.111.48.4 [20/May/2009:15:26:44-0700] ”GET /media//css/roundbox.css HTTP/1.0” 200 731”http://www.phonephishing.info/media//css/main.css” ”Mozilla/5.0(Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10)Gecko/2009042315 Firefox/3.0.10 Ubiquity/0.1.4”

Other examples of system activity are described in Section ...In this document, we often used the term normal behavior refer-

ring to a set of characteristics (e.g., the distribution of the charactersof string parameters, the mean and standard deviation of the values

. D M A

of integer parameters) extracted from the system activity gatheredduring normal operation (i.e., without being compromised). More-over, in the remainder of this document, we need other definitions.

Definition .. (Activity Profile) e activity profile (or activitymodel) cI is a set of models

cI = ⟨m(1), . . . ,m(u), . . . ,m(U)⟩

generated by extracting features from the system activity I.

is definition will be used in Section ... and Example ..and .. describe an instance of m(u). An example of a real-worldprofile is described in Example ... We can now define the:

Definition .. (System Behavior) e system behavior is the setof features (or models), along with their numeric values, extractedby (or contained in) the activity profile.

In particular, we will use this term as a synonym of normal sys-tem behavior, referring to the system behavior during normal op-eration.

Given the high-accessibility of the Internet, publicly availablesystems such as web servers, web applications, DB servers, are con-stantly at risk. In particular, they can be compromised with thegoal of stealing valuable information, deploying infection kits orrunning phishing and spam campaigns. ese are all examples ofintrusions. More formally.

Definition .. (Intrusion) An intrusion is the automated orman-ual act of violating one or more security paradigms (i.e., confiden-tiality, integrity, availability) of a system. Intrusions are formalizedas a sequence of events:

O = [O1, O2, Oi, . . . , OM ] ⊆ I

Typically, when an intrusion takes place, a system behaves unex-pectedly and, as a consequence, its activity differs than during nor-mal operation. is is because an attack or the execution ofmalwarecode often exploit vulnerabilities to bring the system into states itwas not designed for. For this reason, the activity that the system

.. Intrusion Detection

generates is called malicious activity; often, this term is also usedto indicate the attack or the malware code execution itself. An ex-ample of Oi is the following log entry: it shows evidence of a XSSattack that will make a vulnerable page to display the arbitrary con-tent supplied as part of the GET request (while the page was notintentionally designed to this purpose).

/report/add/comment/<DIVSTYLE=”background-image:\0075\0072\006C\0028’\006a

\0061\0076\0061\0073\0063\0072\0069\0070\0074\003a\0061\006c\0065\0072\0074\0028.1027\0058.1053\0053\0027\0029’\0029”>/

HTTP/1.0” 200 731 ”http://www.phonephishing.info/report/add/””Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10)Gecko/2009042315 Firefox/3.0.10 Ubiquity/0.1.4”

Note .. We adopt a simplified representation of intrusions withrespect to a datasetD (i.e., both normal andmalicious): withO ⊆ Dwe indicate that the activity I contains malicious events O; how-ever, strictly speaking, intrusions events and activity events can beof completely different types, thus the ”⊆” relation may not be de-fined in a strict mathematical sense.

If a system is well-designed, any intrusion attempts always leavesome traces in the system’s activity. ese traces are called tamperevidence and are essential to perform intrusion detection.

Definition .. (Intrusion Detection) Intrusion detection is the sep-aration of intrusions from normal activity through the analysis ofthe activity of a system, while the system is running. Intrusionsare marked as alerts, which are formalized as a sequence of eventA = [A1, A2, Ai, . . . , AL] ⊆ O.

Similarly, A ⊆ O must not be interpreted in a strict sense: it isjust a notation to indicate that for each intrusion, an alert may ormay not exist. Note that ID can be also performed by manuallyinspecting a system activity. is is clearly a tedious and unefficienttask, thus research and industrial interests are focused on automaticapproaches.

Definition .. (Intrusion Detection System) An intrusion detec-tion system is an automatic tool that performs the ID task.

. D M A

..I,O .IDS .A

F .: Abstract I/O model of an IDS.

Given the above definitions, an abstract model of an IDS is shownin Figure ..

Although each IDS relies on its own data model and formatto represent A, the Internet Engineering Task Force (IETF) pro-posed IntrusionDetectionMessage Exchange Format (IDMEF) [De-bar et al., ] as a common format for reporting alert streamsgenerated by different IDS.

.. EvaluationEvaluating an IDS means running it on collected data D = I ∪ Othat resembles real-world scenarios. is means that such data in-cludes both intrusions O and normal activity I, i.e., |I|, |O| > 0and I ∩ O = ∅, i.e., I must include no malicious activity otherthan O. e system is run in a controlled environment to collect Aalong with performance indicators for comparison with other sys-tems. is section presents the basic criteria used to evaluate mod-ern IDSs.

More precisely, to perform an evaluation experiment correctlya fundamental hypothesis must hold: the set O must be knownand perfectly distinguishable from I. In other words, this meansthat, D must be labeled with a truth file, i.e., a list of all the eventsknown to be malicious. is allows to treat the ID problem as aclassic classification problem, for which a set of well-establishedevaluation metrics are available. In particular, we are interested atcalculating the following sets.

Definition .. (True Positives (TPs)) e set of true positives isTP := {Ai ∈ A | ∃Oj : f(Ai) = Oj}.

Where f : A 7→ O is a generic function that, given an alertAi findsthe corresponding intrusion Oj by parsing the truth file. e TPset is basically the set of alerts that are fired because a real intrusionhas taken place. e perfect IDS is such that TP ≡ O.

.. Intrusion Detection

Definition .. (True Positives (TPs)) e set of false positives is

FP := {Ai ∈ A | ∃Oj : f(Ai) = Oj}.

On the other hand, the alerts in FP are incorrect because no realintrusion can be found in the observed activity. e perfect IDS issuch that FP = ∅.

Definition .. (True Negatives (TNs)) e set of true negativesis

TN := {Ij ∈ I | ∃Ai : f(Ai) = Ij}.

Note that the set of TN does not contain alerts. Basically, it isthe set of correctly unreported alerts. e perfect IDS is such thatTN ≡ I.

Definition .. (False Negatives (FNs)) e set of false negativesis

FN := {Oj ∈ O | ∃Ai : f(Ai) = Oj}.

Similarly to TN , FN does not contain alerts. Basically, it is theset of incorrectly unreported alerts. e perfect IDS is such thatFN = ∅. Note that, TP + TN + FP + TN = 1 must hold.

In this and other documents, the term false alert refers to FN ∪FP . Given the aforementioned sets, aggregated measures can becalculated.

Definition .. (Detection Rate (DR)) e detection rate, or truepositive rate, is defined as:

DR :=TP

TP + FN.

e perfect IDS is such that DR = 1. us, the DR measures thedetection capabilities of the system, that is, the amount of maliciousevents correctly classified and reported as alerts. On the other side,the False Positive Rate (FPR) is defined as follows.

Definition .. (FPR) e false positive rate is defined as:

FPR :=FP

FP + TN.

. D M A

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Det

ectio

n R

ate

(DR

)

False Positive Rate (FPR)

Random classifier

F .: e ROC space.

e perfect IDS is such that FPR = 0. us, the FPR measuresthe inaccuracies of the system, that is, the amount of legit eventsincorrectly classified and reported as alerts. ere are also othermetrics such as the accuracy and the precision which are often usedto evaluate information retrieval systems; however, these metricsare not popular for IDS evaluation.

e ROC analysis, originally adopted to measure transmissionquality of radar systems, is often used to produce a compact and easyto understand evaluation of classification systems. Even thoughthere is no standardized procedure to evaluate an IDS, the researchcommunity agrees on the use of ROC curves to compare the detec-tion capabilities and quality of an IDS. A ROC curve is the plot ofDR = DR(FPR) and is obtained by tuning the IDS to trade offFalse Positives (FPs) against true positives. Without going into thedetails, each point of a ROC curve correspond to a fixed amount ofDR and FPR calculated under certain conditions (e.g., sensitivityparameters). By modifying its configuration the quality of the clas-sification changes and other points of the ROC are determined.e ROC space is plotted in Figure . along with the perfor-mances of a random classifiers, characterized by DR = FPR. eperfect, ideal IDS can increase its DR from to with FPR = 0:however, it must hold that FPR → 1 ⇒ DR → 1.

.. Intrusion Detection

.. Alert Correlatione problem of intrusion detection is challenging in todays’ com-plex networks. In fact, it is common to have more than one IDSdeployed, monitoring different segments and different aspects ofthe whole infrastructure (e.g., hosts, applications, network, etc.).e amount of alerts reported by a network of IDSs running in acomplex computer infrastructure is larger, by several orders of mag-nitude, than what was common in the smaller networks monitoredyears ago. In such a context, network administrators are loadedby several alerts and long security reports often containing a non-negligible amount of FPs. us, the creation of a clean, compact,and unified view of the security status of the network is needed.is process is commonly known as alert correlation [Valeur et al.,] and it is currently one of the most difficult challenges of thisresearch field. More precisely.

Definition .. (Alert Correlation) e alert correlation is the iden-tification of relations among alerts

A1, A2, Ai, . . . , AL ∈ A1 ∪ A2 ∪ · · · ∪ AK

to generate a unique, more compact and comprehensive sequenceof alerts A′ = [A′

1, A′2, . . . , A

′P ].

A desirable property is that A′ should be as complete as A1 ∪A2 ∪ · · · ∪AK without introducing errors such as alerts that do notcorrespond to a real intrusion. As for the ID, alert correlation canbe a manual, and tedious, task. Clearly, automatic alert correlationsystems are more attractive and can be considered a complement ofa modern IDS.

Definition .. (Alert Correlation System) An alert correlationsystem is an automatic tool that performs the alert correlation task.

e overall IDS abstract model is complemented with an alert cor-relation engine as shown in Figure ..

.. Taxonomic Dimensionse first comprehensive taxonomy of IDSs has been proposed in[Debar et al., ] and revised in [Debar et al., ]. Another

. D M A

..I,O .IDS

.A1

...An

.ACS .A′

F .: Abstract I/O model of an IDS with an alert correlationsystem.

good survey appeared in [Axelsson, a].Compared to the classic survey found in the literature, this sec-

tion complements the basic taxonomic dimensions by focusing onmodern techniques. In particular, todays’ intrusion detection ap-proaches can be categorized bymeans of the specificmodeling tech-niques appeared in recent research.

It is important to note that the taxonomic dimensions that arehereby suggested are not exhaustive, thus certain IDSs may not fitinto it (e.g., [Costa et al., ; Portokalidis et al., ; Newsomeand Song, ]). On the other hand, an exhaustive and detailedtaxonomy would be difficult to read. To overcome this difficulty,in this section we describe a high-level, technique-agnostic taxon-omy based on the dimensions summarized in Table .; in eachsub-section of Section ., which focus on anomaly-based mod-els, we expand the taxonomic dimensions by listing and accuratelydetailing further classification criteria.

... Type of model

IDSs must be divided into two opposite groups: misuse-based vs.anomaly-based. e former create models of the malicious activ-ity while the latter create models of normal activity. Misuse-basedmodels look for patterns of malicious activity; anomaly-based mod-els look for unexpected activity. In some sense, IDSs can either“blacklist” or “whitelist” the observed activity.

Typically, the first type of models consists in a database of all theknown attacks. Besides requiring frequent updates —which is justa technical difficulty and can be easily automated— misuse-basedsystems assumes the feasibility of enumerating all the maliciousactivity.

Despite the limitation of being inherently incomplete, misuse-based systemswidely adopted in the real-world [Roesch, , ].

.. Intrusion Detection

is is mainly due to their simplicity (i.e., attack models are trig-gered by means of pattern matching algorithms) and accuracy (i.e.,they generate virtually no false alerts because an attack signaturecan either match or not).

Anomaly-based approaches are more complex because creat-ing a specification of normal activity is obviously a difficult task.For this reason, there are no well-established and widely-adoptedtechniques; instead, misuse models are as sophisticated as a patternmatching problem. In fact, the research on anomaly-based systemsis very active.

ese systems are effective only under the assumption that ma-licious activity, such as an attack or malware being executed, al-ways produces sufficient deviations from normal activity such thatmodels are triggered, i.e., anomalies. is clearly has the positiveside-effect of requiring zero knowledge on the malicious activity,which makes these systems particularly interesting. e negativeside-effect is their tendency of producing a significant amount offalse alerts (the notion of “significant amount of false alerts” will bediscussed in detail in Section ..).

Obviously, an IDS can benefit of both the techniques by, forinstance, enumerating all the known attacks and using anomaly-based heuristics to prompt for suspicious activity. is practice isoften adopted by modern anti-viruses.

It is important to remark that, differently from the other taxo-nomic dimensions, the distinction between misuse- and anomaly-based approaches is fundamental: they are based on opposite hy-potheses and yield to completely different system designs and re-sults. eir duality is highlighted in Table ..

Example .. (Misuse vs. Anomaly) A misuse-based system Mand an anomaly-based system A process the same log containing afull dump of the system calls invoked by the kernel of an auditedmachine. Log entries are in the form:

<function_name>(<arg1_value>, <arg2_value>, ...)

e system M has the following simple attack model:

Generated from the real world exploit http://milw0rm.com/exploits/9303

. D M A

Feature M- A-Modeled activity: Malicious Normal

Detection method: Matching Deviationreats detected: Known AnyFalse negatives: High LowFalse positives: Low High

Maintenance cost: High LowAttack desc.: Accurate Absent

System design: Easy Difficult

Table .: Duality between misuse- and anomaly-based intrusiondetection techniques. Note that, an anomaly-based IDS can detect“Any” threat, under the assumption that an attack always generatesa deviation in the modeled activity.

if (function_name == ”read”) { /* ... */ if (match(decode(arg3_value), ”a{4}b{4}c{4}d{4}e{4}\ f{4}...x{4}3RH˜TY7{33}QZjAXP0A0AkAAQ2AB2BB0BBAB\ XP8ABuJIXkweaHrJwpf02pQzePMhyzWwSuQnioXPOHuBxKn\ aQlkOjpJHIvKOYokObPPwRN1uqt5PA...” )) fire_alert(”VLC bug 35500 is being exploited!”); /* ... */ }

e simple attack signature looks for a pattern generated fromthe exploit. If the content of the buffer (i.e., arg3 value) that storesthe malicious file matches the given pattern then an alert is fired.On the other hand, the systemA has the following model, based onthe sample character distribution of each file’s content. Such fre-quencies are calculated during normal operation of the application.

/* ... */ cd[’<’] = {0.1, 0.11} cd[’a’] = {0.01, 0.2} cd[’b’] = {0.13, 0.23} /* ... */

b = decode(arg3_value);

if ( !(cd[’c’][0] < count(’c’, b) < cd[’c’][1]) ||\ !(cd[’<’][0] < count(’<’, b) < cd[’<’][1]) ||\ ... || ...) fire_alert(”Anomalous content detected!”); /* ... */

.. Intrusion Detection

Obviously, more sophisticated models can be designed. epurpose of this example is that of highlighting the main differencesbetween the two approaches.

A generalization of the aforementioned examples allows us tobetter define an anomaly-based IDS.

Definition .. (Anomaly-based IDS) An anomaly-based IDS isa type of IDS that generate alerts A by relying on normal activityprofiles (Definition ..).

... System activity

IDSs can be classified based on the type of the activity they mon-itor. e classic literature distinguishes between network-basedand host-based systems; Network-based Intrusion Detection System(NIDS) andHost-based Intrusion Detection System (HIDS), respec-tively. e former inspect network traffic (i.e., raw bytes sniffedfrom the network adapters), and the latter inspect the activity ofthe operating system. e scope of a network-based system is aslarge as the broadcast domain of the monitored network. On theother hand, the scope host-based systems is limited to the singlehost.

Network-based IDSs have the advantage of having a large scope,while host-based ones have the advantage of being fed with de-tailed information about the host they run on (e.g., process infor-mation, Central Processing Unit (CPU) load, number of active pro-cesses, number of users). is information is often unavailable tonetwork-based systems and can be useful to refine a decision re-garding suspicious activity. For example, by inspecting both thenetwork traffic and the kernel activity, an IDS can filter the alertsregarding the Apache web server version .. on all the hosts run-ning version ... On the other hand, network-based systems arecentralized and are much more easy to manage and deploy. How-ever, NIDS are limited to the inspection of unencrypted payload,while HIDS may have it decrypted by the application layer. Forexample, a NIDS cannot detect malicious HTTPS traffic.

e network stack is standardized (see Note ..), thus thedefinition of network-based IDS is precise. On the other hand, be-cause of the immense variety of operating system implementations,

. D M A

a clear definition of “host data” lacks. Existing host-based IDSs an-alyze audit log files in several formats, other systems keep track ofthe commands issued by the users through the console. Some ef-forts have been made to propose standard formats for host data:Basic Security Module (BSM) and its modern re-implementationcalled OpenBSM [Watson and Salamon, ; Watson, ] areprobably the most used by the research community as they allowdevelopers to gather the full dump of the system calls before exe-cution in the kernel.

Note .. Although the network stack implementation may varyfrom system to system (e.g.,Windows and Cisco platforms have dif-ferent implementation of Trasmission Control Protocol (TCP)), it isimportant to underline that the notion of IP, TCP, HTTP packet iswell defined in a system-agnostic way, while the notion of operatingsystem activity is rather vague and by no means standardized.

Example .. describes a sample host-based misuse detectionsystem that inspects the arguments of the system calls. A similar,but more sophisticated, example based on network traffic is Snort[Roesch, , ].

Note .. (Inspection layer) Network-based systems can be fur-ther categorized by their protocol inspection capabilities, with re-spect to the network stack. webanomaly [Kruegel et al., ; Robert-son, ] is network-based, in the sense that runs in promiscu-ous mode and inspects network data. On the other hand, it alsoHTTP-based, in the sense that decodes the payload and recon-structs the HTTP messages (i.e., request and response) to detectattacks against web applications.

... Model construction method

Regardless of their type, misuse- or anomaly-based, models can bespecified either manually or automatically. However, for their na-ture, misuse models are often manually written because they arebased on the exhaustive enumeration of known malicious activ-ity. Typically, these models are called attack signatures; the largestrepository of manually generated misuse signatures is released bythe SourcefireVulnerabilityResearchTeamTM [Sourcefire, ]. Mis-use signatures can be generated automatically: for instance, in [Singh

.. Intrusion Detection

et al., ] amethod to buildmisusemodels of worms is described.Similarly, low-interaction honeypots often uses malware emulationto automatically generate signatures. Two recently proposed tech-niques are [Portokalidis et al., ; Portokalidis and Bos, ].

On the other hand, anomaly-basedmodels are more suitable forautomatic generation. Most of the anomaly-based approaches inthe literature focus on unsupervised learning mechanisms to con-struct models that precisely capture the normal activity observed.Manually specified models are typically more accurate and are lessprone to FPs, although automatic techniques are clearly more de-sirable.

Example .. (Learning character distributions) InExample ..,the described system A adopts a learning based character distribu-tion model for strings. Without going into the details, the ideadescribed in [Mutz et al., ] observes string arguments and es-timate the characters’ distribution over the American Standard forInformation Interxchange (ASCII) set. More practically, the modelis a histogramH(c), ∀c ∈ 0, 255, whereH(c) is the normalized fre-quency of character c. During detection, a χ2 test is used to decidewhether or not a certain sting is deemed anomalous.

Beside the obvious advantage of being resilient against evasion,this model requires no human intervention.

. D M A

. Relevant Anomaly Detection Techniques

Our research focuses on anomaly detection. In this section, theselected state of the art approaches are reviewed with particular at-tention to network-, host- and web-based techniques, along withthe most influential approaches in the recent literature alert corre-lation. is section provides the reader with the basic concepts tounderstand our contributions.

.. Network-based techniquesOur research does not include network-based IDSs. Our contri-butions in alert correlation, however, leverage both network- andhost-based techniques, thus a brief overview of the latest (i.e., pro-posed between and ) network-based detection approachesis provided in this section. We remark that all the techniques in-cluded in the following are based on TCP/IP, meaning that modelsof normal activity are constructed by inspecting the decoded net-work frames up to the TCP layer.

In Table . the selected approaches are marked with bulletsto highlight their specific characteristics. Such characteristics arebased on our analysis and experience, thus, other classificationsmaybe possible. ey are defined as follows:

Time refers to the use of timestamp information, extracted fromnetwork packets, to model normal packets. For example,normal packets may be modeled by their minimum and max-imum inter-arrival time.

Header means that the TCP header is decoded and the fields aremodeled. For example, normal packets may be modeled bythe observed ports range.

Payload refers to the use of the payload, either at Internet Protocol(IP) or TCP layer. For example, normal packets may bemodeled by the most frequent byte in the observed payloads.

Stochastic means that stochastic techniques are exploited to createmodels. For example, the model of normal packets may beconstructed by estimating the sample mean and variance ofcertain features (e.g., port number, content length).

.. Relevant Anomaly Detection Techniques

Deterministic means that certain features are modeled followinga deterministic approach. For example, normal packets maybe only those containing a specified set of values for the TimeTo Live (TTL) field.

Clustering refers to the use of clustering (and subsequent classifi-cation) techniques. For instance, payload byte vectors may becompressed using a Self Organizing Map (SOM) where classof different packets will stimulate neighbor nodes.

Note that, since recent research have experimented with severaltechniques and algorithms, mixed approaches exist and often leadto better results.

In [Mahoney and Chan, ] a mostly deterministic, simpledetection technique is presented. During training, each field ofthe header of the TCP packets are extracted and tokenized into bytes bins (for memory efficiency reasons). e tokenized valuesare clustered by means of their values and every time a new value isobserved the clustering is updated. e detection approach is de-terministic, since a packet is classified as anomalous if the values ofits header do not match any of the clusters. Besides the fact of beingcompletely unsupervised, this techniques detects between and of the probe and DoS attacks, respectively, in Intrusion Detec-tion eVALuation (IDEVAL) [Lippmann et al., ]. Slightperformance issues and a rate of FPs per day (i.e., roughly, false alert every hours) are the only disadvantage of the approach.

e approach described in [Kruegel et al., ] reconstructsthe payload stream for each service (i.e., port); this avoids to evadethe detection mechanism by using packet fragmentation. In addi-tion to this, a basic application inspection is performed to distin-guish among the different types of request of the specific service,e.g., for HTTP the service type could be GET, POST, HEAD.e sample mean and variance of the content length are also calcu-lated and the distribution of the bytes (interpreted as ASCII char-acters) found in the payload is estimated using simple histograms.e anomaly score used for detection aggregates information re-garding the type of service, expected content length and payloaddistribution. With a low performance overhead and low FPR, thissystem is capable of detecting anomalous interactions with the ap-

. D M A

A

T

H

PS

D

.C

[Mahoney

andC

han,]•

[Kruegeletal.,]

••

[Sekaretal.,]•

••

[Ram

adas,]•

[Mahoney

andC

han,b]•

••

[Zanero

andSavaresi,]

••

[Wang

andStolfo,]

••

[Zanero,b]

••

[Bolzonietal.,]•

••

[Wang

etal.,]•

Table.:Taxonomyoftheselected

stateoftheartapproachesfornetwork-basedanom

alydetection.

.. Relevant Anomaly Detection Techniques

plication layer. One critique is that the system has not been testedon several applications other than Domain Name System (DNS).

Probably inspired by the misuse-based, finite-state techniquedescribed in [Vigna and Kemmerer, ], in [Sekar et al., ]the authors describe a system to learn the TCP specification. ebasic finite state model is extended with a network of stochasticproperties (e.g., the frequency of certain transitions, the most com-mon value of a state attribute, the distribution of the values foundin the fields of the IP packets) among states and transitions. Suchproperties are estimated during training and exploited at detectionto implement smoother thresholds that ensure as low as . falsealerts per day. On the other hand, the deterministic nature of thefinite state machine detects attacks with a DR.

Learning Rules for AnomalyDetection (LERAD), the system de-scribed in [Mahoney andChan, b] is an optimized ruleminingalgorithm that works well on data with tokenized domains suchas the fields of TCP or HTTP packets. Although the idea im-plemented in LERAD can be applied at any protocol layer, it hasbeen tested on TCP and HTTP but no more than the of theattack in the testing dataset were detected. Even if the FPR is ac-ceptable (i.e., alerts per day) its limited detection capabilitiesworsen if real-world data is used instead of synthetic datasets suchas IDEVAL. In [Tandon and Chan, ] the LERAD algorithm(Learning Rules for Anomaly Detection) is used to mine rules ex-pressing “normal” values of arguments, normal sequences of systemcalls, or both. No relationship is learned among the values of differ-ent arguments; sequences and argument values are handled sepa-rately; the evaluation is quite poor however, and uses non-standardmetrics.

Unsupervised learning techniques to mine pattern from pay-load of packets has been shown to be an effective approach. Boththe network-based approaches described so far and other proposals[Labib and Vemuri, ; Mahoney, ] had to cope with datarepresented using a high number of dimensions (e.g., a vector with dimensions, that is the maximum number of bytes in the TCPpayload). While the majority of the proposals circumvent the is-sue by ignoring the payload, the aforementioned issue is brilliantlysolved in [Ramadas, ] and extended in Unsupervised LearningIDS with -Stages Engine (ULISSE) [Zanero and Savaresi, ;Zanero, b] by exploiting the clustering capabilities of a SOM

. D M A

[Kohonen, ] with a faster algorithm [Zanero, a], specif-ically designed for high-dimensional data. e payload of TCPpackets is indeed compressed into the bi-dimensional grid repre-senting the SOM, organized in such a way that class of packetscan be quickly extracted. e approach relies on, and confirms,the assumption that the traffic belongs to a relatively small numberof services and protocols that can be mapped onto a small numberof clusters. Network packet are modeled as a multivariate time-series, where the variables include the packet class into the SOMplus some other features extracted from the header. At detection,a fast discounting learning algorithm for outlier detection [Yaman-ishi et al., ] is used to detect anomalous packets. Although thesystem has not been released yet, the prototype is able to reach a. DR with as few as . FPs. In comparison, one of theprototype dealing with payloads available in literature [Wang et al.,], the best overall result leads to the detection of . of theattacks, with a FPR that is between . and . e main weak-ness of this approach is that it works at the granularity of the pack-ets and thus might be prone to simple evasion attempts (e.g., bysplitting an attack onto several malicious packets, interleaved withlong sequence of legit packets). Inspired by [Wang et al., ]and [Zanero and Savaresi, ], [Bolzoni et al., ] has beenproposed.

e approach presented in [Wang et al., ] differs from theone described in [Zanero and Savaresi, ; Zanero, b] eventhough the underlying key idea is rather similar: byte frequency dis-tribution. Both the two approaches, and also [Kruegel et al., ],exploit the distribution of byte values found in the network packetsto produce some sort of “normality” signatures. [Wang et al., ]utilizes a simple clustering algorithm to aggregate similar packetsand produce a smoother and more abstract signature, [Zanero andSavaresi, ; Zanero, b] introduces the use of SOM to ac-complish the task of finding classes of normal packets.

An extension to [Wang and Stolfo, ], which uses 1-grams,is described in [Wang et al., ] that uses higher-order, random-ized n-grams to mitigate mimicry attacks. In addition, the newerapproach does not estimate the frequency distribution of the n-grams, which causes many FPs if n increases. Instead, it adoptsa filtering technique to compress the n-grams into memory effi-cient arrays of bits. is decreased the FPR of about two orders of

.. Relevant Anomaly Detection Techniques

magnitude.

.. Host-based techniquesA survey of the latest (i.e., proposed between and ) host-based detection approaches is provided in this section. Most ofthe techniques leverage system call invoked by the kernel to createmodels of normal behavior of processes.

In Table . the selected approaches are marked with bulletsto highlight their specific characteristics. Such characteristics arebased on our analysis and experience, thus, other classificationsmaybe possible. ey are defined as follows:

Syscall refers, in general, to the use of system calls to characterizenormal host activity. For example, a process may be modeledby the stream of system calls invoked during normal opera-tion.

Stochastic means that stochastic techniques are exploited to createmodels. For example, the model of normal processes may beconstructed by estimating the sample mean and variance ofcertain features (e.g., length of the open’s path argument).

Deterministic means that certain features are modeled following adeterministic approach. For example, normal processes maybe only those that use a fixed number, say , of file descrip-tors.

Comprehensive approaches are those that have been extensivelydeveloped and, in general, incorporate a rich set of features,beyond the proof-of-concept.

Context refers to the use of context information in general. Forexample, the normal behavior of processes can be modeledalso by means of the number of the environmental variablesutilized or by the sequence of system calls invoked.

Data means that the data flow is taken into account. For example,normal processes may be modeled by the set of values of thesystem call arguments.

Forensics means that the approach has been also evaluated for off-line, forensics analysis.

. D M A

Our contributions are included in Table . and are detailedin Chapter . Note that, since recent research have experimentedwith several techniques and algorithms, mixed approaches exist andoften lead to better results.

Host-based anomaly detection has been part of intrusion de-tection since its very inception: it already appears in the seminalwork [Anderson, ]. However, the definition of a set of sta-tistical characterization techniques for events, variables and coun-ters such as the CPU load and the usage of certain commands isdue to [Denning, ]. e first mention of intrusion detectionthrough the analysis of the sequence of syscalls from system pro-cesses is in [Forrest et al., ], where “normal sequences” of sys-tem calls are considered. A similar idea was presented earlier in[Ko et al., ], which proposes a misuse-based idea by manuallydescribe the canonical sequence of calls of each and every program,something evidently impossible in practice.

In [Lee and Stolfo, ] a set of models based on data miningtechniques is proposed. In principle, the models are agnostic withrespect to the type of raw event collected, which can be user activity(e.g., login time, CPU load), network packets (e.g., data collectedwith tcpdump), or operating system activity (e.g., system call tracescollected with OpenBSM). Events are processed using automaticclassification algorithms to assign labels drawn from a finite set.In addition, frequent episodes are extracted and, finally, associationrules among events are mined. Such there algorithms are combinedtogether at detection time. Events marked with wrong labels, un-expectedly frequent episodes or rule violations will all trigger alerts.

Alternatively, other authors proposed to use static analysis, asopposed to dynamic learning, to profile a program’s normal behav-ior. e technique described in [Wagner and Dean, ] com-bines the benefits of dynamic and static analysis. In particular, threemodels are proposed to automatically derive a specification of theapplication behavior: call-graph, context-free grammars (or non-deterministic pushdown automata), and digraphs. All the mod-els’ building blocks are system calls. e call-graph is staticallyconstructed and then simulated, while the program is running, toresolve non-deterministic paths. In some sense, the context-freegrammar model —called abstract stack— is the evolution of thecall-graph model as it allows to keep track of the state (i.e., callstack). e digraph —actually k-graph— model is keeps track of

.. Relevant Anomaly Detection Techniques

A

S

D

.

S

C

.C

D

F

[Lee

and

Stol

fo,

]

••

[Sek

aret

al.,

]

••

[Wag

nera

ndD

ean,

]•

•[T

ando

nan

dC

han,

]•

••

[ Kru

egel

etal.

,

a]

••

••

•[Z

aner

o,

]

••

•[G

iffin

etal.

,

]•

••

•[M

utze

tal.,

]•

••

•[B

hatk

aret

al.,

]

••

••

[Mut

zeta

l.,

]

••

••

[Fet

zera

ndSu

essk

raut

,

]•

••

•[M

aggietal.,

]•

••

••

[Maggietal.,a]

••

••

•[ Frossietal.,

]•

••

••

Tabl

e.

:Tax

onom

yoft

hese

lecte

dsta

teof

thea

rtap

proa

ches

forh

ost-ba

sed

anom

alyd

etec

tion.

Our

cont

ribut

ions

areh

ighl

ight

ed.

. D M A

k-long sequences of system calls from an arbitrary point of the ex-ecution. Despite is simplicity, which ensures a negligible perfor-mance overhead with respect to the others, this model achieves thebest detection precision.

In [Tandon and Chan, ] the LERAD algorithm (Learn-ing Rules for Anomaly Detection) is described. Basically, it isa learning system to mine rules expressing normal values of ar-guments, normal sequences of system calls, or both. In partic-ular, the basic algorithm learns rules in the form A = a,B =b, · · · ⇒ X ∈ {x1, x2, . . . } where uppercase letters indicate pa-rameter names (e.g., path, flags) while lowercase symbols indicatetheir corresponding values. For some reason, the rule-learning al-gorithm first extracts random pairs from the training set to gener-ate a first set of rules. After this, two optimization steps are runto remove rules with low coverage and those prone to generate FPs(according to a validation dataset). A system call is deemed anoma-lous if no matching rule is found. A similar learning and detectionalgorithm is run among sequences of system calls. e main ad-vantage of the described approach is that no relationship is learnedamong the values of different arguments of the same system call.Beside the unrealistic assumption regarding the availability of a la-beled validation dataset, another side issue is that the evaluation isquite poor and uses non-standard metrics.

LibAnomaly [Kruegel et al., a] is a library to implementstochastic, self-learning, host-based anomaly detection systems. Ageneric anomaly detection model is trained using a number of sys-tem calls from a training set. At detection time, a likelihood ratingis returned by the model for each new, unseen system call (i.e., theprobability of it being generated by the model). A confidence rat-ing can be computed at training for any model, by determining howwell it fits its training set; this value can be used at runtime to pro-vide additional information on the reliability of the model. Whendata is available, by using cross-validation, an overfitting rating canalso be optionally computed.

LibAnomaly includes four basicmodels. e string lengthmodelcomputes, from the strings seen in the training phase, the samplemean µ and variance σ2 of their lengths. In the detection phase,given l, the length of the observed string, the likelihood p of the in-put string length with respect to the values observed in training is

.. Relevant Anomaly Detection Techniques

equal to one if l < µ+σ and σ2

(l−µ)2 otherwise. As mentioned in theExample .., the character distribution model analyzes the dis-crete probability distribution of characters in a string. At trainingtime, the so called ideal character distribution is estimated: eachstring is considered as a set of characters, which are inserted intoan histogram, in decreasing order of occurrence, with a classicalrank order/frequency representation. During the training phase, acompact representation of mean and variance of the frequency foreach rank is computed. For detection, a χ2 Pearson test returnsthe likelihood that the observed string histogram comes from thelearned model. e structural inference model encodes the syntaxof strings. ese are simplified before the analysis, using a set of re-duction rules, and then used to generate a probabilistic grammar bymeans of a Markov model induced by exploiting a Bayesian merg-ing procedure, as described in [Stolcke and Omohundro, c,b, b]. e token search model is applied to argumentswhich contain flags or modes. During detection, if the field hasbeen flagged as a token, the input is compared against the storedvalues list. If it matches a former input, the model returns (i.e.,not anomalous), else it returns (i.e., anomalous).

In [Zanero, ] a general Bayesian framework for encodingthe behavior of users is proposed. e approach is based on hintsdrawn from the quantitative methods of ethology and behavioralsciences. e behavior of a user interacting with a text-based con-sole is encoded as a Hidden Markov Model (HMM). e observa-tion set includes the commands (e.g., ls, vim, cd, du) encounteredduring training. us, the system learns the user behavior in termsof the model structure (e.g., number of states) and parameters (e.g.,transitionmatrix, emission probabilities). At detection, unexpectedor out of context commands are detected as violations (i.e., lowervalue) of the learned probabilities. One of the major drawbacks ofthis system is its applicability to real-world scenarios: in fact, to-days’ host-based threats perform more sophisticated and stealthyoperations than invoking commands.

In [Giffin et al., ] an improved version of [Wagner andDean, ] is presented. It is based on the analysis of the binariesand incorporates the execution environment as a model constraint.More precisely, the environment is defined as the portion of inputknown at process load time and fixed until it exits. In addition,

. D M A

the technique deals with dynamically-linked libraries and is capableof constructing the data-flow analysis even across different shared-objects.

An extended version of LibAnomaly is described in [Mutz et al.,]. e basic detectionmodels are essentially the same. In addi-tion, the authors exploit Bayesian networks instead of naive thresh-olds to classify system calls according to each model output. isresults in an improvement in the DRs. Some of our work describedin Section is based upon this and the original version of LibAno-maly.

Data-flow analysis has been also recently exploited in [Bhatkaret al., ], where an anomaly detection framework is developed.Basically, it builds an Finite State Automaton (FSA) model of eachmonitored program, on top of which it creates a network of rela-tions (called properties) among the system call arguments encoun-tered during training. Such a network of properties is the maindifference with respect to other FSA based IDSs. Instead of apure control flow check, which focuses on the behavior of the soft-ware in terms of sequences of system calls, it also performs a socalled data flow check on the internal variables of the program alongtheir existing cycles. is approach has really interesting properties,among which the fact that not being stochastic useful propertiescan be demonstrated in terms of detection assurance. On the otherhand, though, the set of relationships that can be learned is limited(whereas the relations encoded by means of the stochastic modelswe describe in Section .. are not decided a priori and thus vir-tually infinite). e relations are all deterministic, which leads toa brittle detection model potentially prone to FPs. Finally, it doesnot discover any type of relationship between different argumentsof the same call.

is knowledge is exploited in terms of unary and binary re-lationships. For instance, if an open system call always uses thesame filename at the same point, a unary property can be derived.Similarly, relationships among two arguments are supported, by in-ference over the observed sequences of system calls, creating con-straints for the detection phase. Unary relationships include equal(the value of a given argument is always constant), elementOf (anargument can take a limited set of values), subsetOf (a generaliza-tion of elementOf, indicating that an argument can take multiplevalues, all of which drawn from a set), range (specifies boundaries

.. Relevant Anomaly Detection Techniques

source dir = d i r ; t a r g e t f i l e = f i l e ;out = open ( t a r g e t f i l e , WR) ;

push ( source dir ) ; while ( ( dir name =pop ( ) ) != NULL) {

d = opendir ( dir name ) ; foreach (d i r en t ry ∈ d ) {

if ( i sd i r ec to ry ( d i r en t ry ) ) push ( d i r en t ry ) ; else { in = open ( d i r en t ry , RD) ; read (

in , buf ) ; wri te ( out , buf ) ; close ( in ) ; } } } close ( out ) ; return 0;

}

1

3

6

8

11 12

13

14

18

1920

start(I, O)

FD3 = open(F3, M3)M3elementOf{WR}

M3equal O

opendir(F6)

isWithinDirI

F8isWithinDir F6

isDirectory F8

F!8

isWithinDir F6

isDirectory F!8

FD11 = open(F11, M11)

F11equal F8

read(FD12)

FD12equal FD11

write(FD13)

FD13equal FD3

close(FD!14

)

FD!14

equal FD11

close(FD14)

FD14equal FD11

close(FD18)

FD!18

equal FD3

return(0)

1

F .: A data flow example with both unary and binary rela-tions.

for numeric arguments), isWithinDir (a file argument is always con-tained within a specified directory), hasExtension (file extensions).Binary relationships include: equal (equality between system calloperands), isWithinDir (file located in a specified directory; con-tains is the opposite), hasSameDirAs, hasSameBaseAs, hasSameEx-tensionAs (two arguments have a common directory, base directoryor extension, respectively).

e behavior of each application is logged by storing the ProcessIDentifier (PID), the Program Counter (PC), along with the systemcalls invoked, their arguments and returned value. e use of thePC to identify the states in the FSA stands out as an importantdifference from other approaches. e PC of each system call isdetermined through stack unwinding (i.e., going back through theactivation records of the process stack until a valid PC is found).e technique obviously handles process cloning and forking.

e learning algorithm is rather simple: each time a new valueis found, it is checked against all the known values of the sametype. Relations are inferred for each execution of the monitoredprogram and then pruned on a set intersection basis. For instance,if relations R1 and R2 are learned from an execution trace T1 butR1 only is satisfied in trace T2, the resulting model will not containR2. Such a process is obviously prone to FPs if the training phaseis not exhaustive, because invalid relations would be kept instead

. D M A

of being discarded. Figure . shows an example (due to [Bhatkaret al., ]) of the final result of this process. During detection,missing transitions or violations of properties are flagged as alerts.e detection engine keeps track of the execution over the learnedFSA, comparing transitions and relations with what happens, andraising an alert if an edge is missing or a constraint is violated.

is FSA approach is promising and has interesting featuresespecially in terms of detection capabilities. On the other hand, itonly takes into account relationships between different types of ar-guments. Also, the set of properties is limited to pre-defined onesand totally deterministic. is leads to a possibly incomplete de-tection model potentially prone to false alerts. In Section . wedetail how our approach improves the original implementation.

Another approach based on the analysis of system calls is [Fet-zer and Suesskraut, ]. In principle, the system is similar to thebehavior-based techniques we mentioned before. However, the au-thors have tried to overcome two limitations of the learning-basedapproaches which, typically, have high FPRs and require a quiteample training set. is last issue is mitigated by adopting a com-pletely different approach: instead of requiring training, the sys-tem administrator is required to specify a set of small whitelist-likemodels of the desired behavior of a certain application. At run-time, these models are evolved and adapted to the particular con-text the protected application runs into; in particular, the systemexploits taint analysis to update a system call model on-demand.is system can offer very high levels of protection but the effortrequired to specify the initial model may not be so trivial; however,the effort may be worth for mission-critical applications on whichcustomized hardening would be needed anyways.

.. Web-based techniquesA survey of the latest (i.e., proposed between and ) host-based detection approaches is provided in this section. All the tech-niques included in the following are based on HTTP, meaning thatmodels of normal activity are constructed either by inspecting thedecoded network frames up to the HTTP layer, or by acting asreverse HTTP proxies.

In Table . the selected approaches are marked with bulletsto highlight their specific characteristics. Such characteristics are

.. Relevant Anomaly Detection Techniques

based on our analysis and experience, thus, other classificationsmaybe possible. ey are defined as follows:

Adaptive refers to the capability of self-adapting to variations inthe normal behavior.

Stochastic means that stochastic techniques are exploited to createmodels. For example, the model of normal HTTP requestsmay be constructed by estimating the sample mean and vari-ance of certain features (e.g., length of the string parameterscontained in a POST request).

Deterministic means that certain features are modeled followinga deterministic approach. For example, normal HTTP ses-sions may be only those that are generated by a certain finitestate machine.

Comprehensive approaches are those that have been extensivelydeveloped and, in general, incorporate a rich set of features,beyond the proof-of-concept.

Response indicates that HTTP responses are modeled along withHTTP requests. For instance, normal HTTP responses maybe modeled with the average number of <script /> nodesfound in the response body.

Session indicates that the concept of web application session istaken into account. For example, normal HTTP interactionsmay be modeled with the sequences of paths correspondingto a stream of HTTP requests.

Data indicates that parameters (i.e., GET and POST variables)contained into HTTP requests are modeled. For example,normal HTTP request may be modeled as the cardinality ofstring parameters in each request.

Our contributions are included in Table . and detailed inChapter . Note that, since recent research have experimented withseveral techniques and algorithms, mixed approaches exist and of-ten lead to better results.

Anomaly-based detectors specifically designed to protect webapplications are relatively recent. ey have been first proposed

. D M A

A

A

SD

C

.R

SD

[Cho

andC

ha,]•

[Kruegeletal.,]

••

[Inghametal.,]

••

[Criscioneetal.,]

••

••

[Songetal.,]

••

[ Maggietal.,c]

••

••

[Robertson

etal.,]•

••

Table.:Taxonomyoftheselected

stateoftheartapproachesforweb-basedanom

alydetection.Ourcontributions

arehighlighted.

.. Relevant Anomaly Detection Techniques

in [Cho and Cha, ], where a system to detect anomalies inweb application sessions is described. Like most of the approachesin the literature, this technique assumes that malicious activity ex-presses itself in the parameters found into HTTP requests. In thecase of this tool, such data is parsed from the access logs. UsingBayesian technique to assign a probability score to the k-sequences(k = 3 in the experiments) of requested resources (e.g., /path/to/-page), the system can spot out unexpected sessions. Even thoughthis approach has been poorly evaluated, it proposed the basic ideason which the current research is still based.

e first technique to accurately model the normal behavior ofweb application parameters is described in [Kruegel et al., ].is approach is implemented inwebanomaly, a tool that can be de-ployed in real-world scenarios. In some sense,webanomaly [Robert-son, ] is the adaptation of LibAnomaly models to capture thenormal features of the interaction between client and server-sideapplications through the HTTP protocol. Instead of modeling asystem call and its arguments, the same models are mapped ontoresources and their parameters (e.g., ?p=1&show=false). Obviously,parameters are the focus of the analysis which employs string lenghtmodels, token finder models, and so forth; in addition, sessionsfeatures are captured as well in terms of sequences of resources.In recent versions of the tools, webanomaly incorporated modelsof HTTP responses similar to those described in [Criscione et al.,] (see Section ..). Besides the features sharedwith [Kruegelet al., ], the approachmodels theDocumentObjectModel (DOM)to enhance the detection capabilities against SQL injection andXSS attacks. In Section . an approach that exploit HTTP re-sponses to detect changes and update other anomaly models is de-scribed.

e approach described in [Ingham et al., ] learns deter-ministic models, FSA, of HTTP requests. e idea of extractingresources and parameters is similar to that described in [Kruegelet al., ]. However, instead of adopting sophisticated mod-els such as HMM to encode strings’ grammar, this system appliesdrastic reductions to the parameters values. For instance, dates aremapped to {0, 1} where indicates that the format of the date isknown, otherwise; filenames are replaced with either a length orthe extension, if the file-type is known; and so forth. For each re-quest, the output of this step is a list of tokens that represent the

. D M A

states of the FSA. Transitions are labeled with the same tokens byprocessing them in chronological order (i.e., as they appear in therequest). In some sense, this approach can be considered a portingto the web domain of the techniques used to model process be-havior by means of FSA [Wagner and Dean, ; Bhatkar et al.,].

A tool to protect against code-injection attacks has been re-cently proposed in [Song et al., ]. e approach exploits a mix-ture of Markov chains to model legitimate payloads at the HTTPlayer. e computational complexity of n-grams with large n issolved using Markov chain factorization, making the system algo-rithmically efficient.

.. Relevant Alert Correlation Techniques

. Relevant Alert Correlation Techniques

A survey of the latest (i.e., proposed between and ) alertcorrelation approaches is provided in this section.

In Table . the selected approaches are marked with bulletsto highlight their specific characteristics. Such characteristics arebased on our analysis and experience, thus, other classificationsmaybe possible. ey are defined as follows:

Formal means that formal methods are used to specify rigorousmodels used in the correlation process. For instance, the re-lations among alerts are specified by means of well definedfirst-order logic formulae.

Verification means that the success of each alert is taken into ac-count. For instance, all the alerts related to attacks againstport are discarded if the target system does not run HTTPservices.

Stochastic means that stochastic techniques are exploited to createmodels. For example, statistic hypothesis tests may be usedto decide correlation among stream of alerts.

Comprehensive approaches are those that have been extensivelydeveloped and, in general, incorporate a rich set of features,beyond the proof-of-concept.

Time refers to the use of timestamp information extracted fromalerts. For example, alerts streams may be modeled as timeseries.

Impact refers to techniques that take into account the impact (e.g.,the cost) of handling alerts. For instance, alerts regardingnon-business critical machines are marked low priority.

Clustering refers to the use of clustering (and subsequent classifi-cation) techniques. For instance, similar alerts can be groupedtogether by exploiting custom distance metrics.

Our contributions are included in Table . and are detailedin Chapter . Note that, since recent research have experimentedwith several techniques and algorithms, mixed approaches exist andoften lead to better results.

. D M A

A

F

V.C

.

T

SC

I

[Debarand

Wespi,]

[ValdesandSkinner,]

[ Porrasetal.,]•

[Cuppensand

Miage,]

••

[Morin

etal.,]•

[Julischand

Dacier,]

[Qin

andLee,]

••

[Valeuretal.,]•

••

[ Kruegeland

Robertson,]

[Viinikkaetal.,]

••

[Maggiand

Zanero,]

••

[Frias-Martinezetal.,]

••

[Maggietal.,b]

Table.:Taxonomyoftheselected

stateoftheartapproachesforalertcorrelation.Ourcontributionsarehighlighted.

Ourcontributionsarehighlighted.

.. Relevant Alert Correlation Techniques

A deterministic intrusion detection technique adapted for alertcorrelation is shown in [Eckmann et al., ]. e use of fi-nite state automata enables for complex scenario descriptions, butit requires known scenarios signatures. It is also unsuitable forpure anomaly detectors which cannot differentiate among differ-ent types of events. Similar approaches, with similar strengths andshortcomings but different formalisms, have been tried with thespecification of pre- and post-conditions of the attacks [Temple-ton and Levitt, ], sometimes along with time-distance criteria[Ning et al., ]. It is possible to mine scenario rules directlyfrom data, either in a supervised [Dain and Cunningham, ] orunsupervised [Julisch and Dacier, ] fashion.

Statistical techniques have been also proposed, for instance E-MERALD implements an alert correlation engine based on prob-abilistic distances [Valdes and Skinner, ] which relies on a setof similarity metrics between alerts to fuse “near” alerts together.Unfortunately, its performance depends on appropriate choice ofweighting parameters.

e best examples of algorithms that do not require such fea-tures are based on time-series analysis and modeling. For instance,[Viinikka et al., ] is based on the construction of time-seriesby counting the number of alerts occurring into sampling intervals;the exploitation of trend and periodicity removal algorithms allowsto filter out predictable components, leaving real alerts only as theoutput. More than a correlation approach, this is a false-positiveand noise-suppression approach, though.

In [Qin and Lee, ] an interesting algorithm for alert cor-relation which seems suitable also for anomaly-based alerts is pro-posed. Alerts with the same feature set are grouped into collectionsof time-sorted items belonging to the same “type” (following theconcept of type of [Viinikka et al., ]). Subsequently, frequencytime series are built, using a fixed size sliding-window: the resultis a time-series for each collection of alerts. e prototype thenexploits the Granger Causality Test (GCT) [urman and Fisher,], a statistical hypothesis test capable of discovering causalityrelationships between two time series when they are originated bylinear, stationary processes. e GCT gives a stochastic measure,called Granger Causality Index (GCI), of how much of the historyof one time series (the supposed cause) is needed to “explain” theevolution of the other one (the supposed consequence, or target).

. D M A

e GCT is based on the estimation of two models: the first is anAuto Regressive (AR) model, in which future samples of the targetare modeled as influenced only by past samples of the target itself;the second is an Auto Regressive Moving Average eXogenous (AR-MAX) model, which also takes into account the supposed causetime series as an exogenous component. A statistical F-test builtupon the model estimation errors selects the best-fitting model: ifthe ARMAX fits better, the cause effectively influences the target.

In [Qin andLee, ] the unsupervised identification of causalrelationships between events is performed by repeating the aboveprocedure for each couple of time-series. e advantage of the ap-proach is that it does not require prior knowledge (even if it may useattack probability values, if available, for an optional prioritizationphase). However, in Section we show that the GCT fails how-ever in recognizing “meaningful” relationships between IDEVALattacks.

Techniques based on the reduction of FPs in anomaly detec-tion systems has also been studied in [Frias-Martinez et al., ].Similar behavioral profiles for individual hosts are grouped togetherusing a k-means clustering algorithm. However, the distance met-ric used was not explicitly defined. Coarse network statistics such asthe average number of hosts contacted per hour, the average num-ber of packets exchanged per hour, and the average length of pack-ets exchanged per hour are all examples of metrics used to gener-ate behavior profiles. A voting scheme is used to generate alerts,in which alert-triggering events are evaluated against profiles fromother members of that cluster. Events that are deemed anomalousby all members generate alerts.

Last, as will be briefly explained in Chapter , the alert corre-lation task may involve alert verification, i.e., before reporting analert, a procedure is run to verify whether or not the attack actu-ally left some traces or, in general, had some effect. For exam-ple, this may involve checking whether a certain TCP port, say,, is open; if not, all alerts regarding attacks against the protectedHTTP server may be safely discarded, thus reducing the FPR. Al-though an alert correlation system would benefit from such tech-niques, we do not review them, since our focus is on the actualcorrelation phase, i.e., recognizing related events, rather than prun-ing uninteresting events. e interested reader may refer to a recentwork [Bolzoni et al., ] that, beside describing an implementa-

.. Relevant Alert Correlation Techniques

tion of a novel verification system, introduces the alert verificationproblem and provides a comprehensive review of the related work.

. D M A

. Evaluation Issues and Challenges

e evaluation of IDSs is per se an open, and very difficult to solve,research problem. Besides two attempts of proposing a rigorousmethodology for IDS evaluation [Puketza et al., , ], thereare no standard guidelines to perform this task. is problem ismagnified by the lack of a reliable source of test data, that is a well-known issue. However, building a dataset to conduct repeatableexperiments is clearly a very difficult task because it is nearly im-possible to reproduce the activity of a real-world computer infras-tructure. In particular:

• for privacy reasons, researchers cannot audit an infrastructureand collect arbitrary data; in the best case, the use of obfusca-tion and anonymization of the payload (e.g., character sub-stitution on text-based application protocol) to protect theprivacy, produces unrealistic content that will make inspec-tion techniques adopted by many IDS to fail. For instance,changes in the syntax and lexicon of strings have negativeconsequences for models such as the character distributionestimator described in Example ...

• System activity collected from real-world infrastructures in-evitably contain intrusions and not all of them are knownin advance (i.e., -day attacks); this makes the generation ofa truth file a very difficult task. To workaround this prob-lem, it is a common practice to filter out known attacks usingmisuse-based systems such as Snort and use the alert log asthe truth file. Unfortunately, this implicitly assumes that thechosen misuse-based system is the baseline of the evaluation(i.e., the best tool). On the other hand, anomaly-based sys-tems are supposed to detect unknown attacks; thus, using thealert log as the truth file makes the evaluation of such systemscompletely meaningless.

• In the past, two attempts have been made to simulate the useractivity on a comprehensive computer infrastructure, i.e., amilitary computer network, to collect clean, background au-dit data to train IDSs based on unsupervised learning tech-niques. Even though this ensures that the traffic contain nointrusions, the approach has been shown to have at least two

.. Evaluation Issues and Challenges

types of shortcomings. e first is described in Section ...e second problem is that, the attacks included (labeled inadvance) do once again represent only known attacks sincethe exploits are taken from public repositories. us, IDSscannot be tested against realistic intrusions such as customattacks against proprietary and never-exploited-before sys-tems.

Attempts to partially solve these issues can be divided into twogroups. Some approaches proposes automated mechanisms to gen-erating testing datasets, while other concentrate on the method-ologies used to perform the evaluation task. Notably, in [Cretuet al., b] a traffic sanitization procedure is proposed in orderto clean the background traffic; however, it is not clear to whatextent this method is substantially different from running an arbi-trary anomaly-based IDS to filter out attacks. Examples of pub-licly available, but obsolete or unreliable, datasets are [Lippmannet al., , ; Potter, ; Swartz, ; S. and D., ]:as exemplified by the critique described in the next section, how-ever, all these dataset are either unusable because of their lack ofa truth file or are extremely biased with respect to the real world.Among the methodological approaches, [Vigna et al., ] pro-poses a tool to automatically test the effectiveness of evasion tech-niques based on mutations of the attacks. Instead, [Lee and Xiang,] proposes alternative metrics to evaluate the detection capa-bilities, [Sharif et al., ] focuses on the evaluation issues in thecase of host-based IDSs and defines some criteria to correctly cal-culate their accuracy.

Recently, probably inspired by [Potter, ], the research ismoving toward mechanisms to instrument large computer securityexercises [Augustine et al., ] and contests such as the DEF-CON Capture e Flag (CTF) with the goal of collecting datasetsto test IDS. e most up-to-date example is [Sangster et al., ],which describes the efforts made to collect the public dataset avail-able at [Sangster, ]. is dataset have the advantage of con-taining a rich variety of attacks including -days and custom, un-published exploits. However, because of the aforementioned rea-sons, this approach fails once again on the labeling phase since Snort

Details available at www.defcon.org

. D M A

is used to generate the truth file. A side question is to what extentthe background traffic represent the real activity on the Internet. Infact, since the competition was run in the controlled environmentof a private network, the users tend to behave in a different way, notto mention that most of the participants are experienced computerusers.

.. Regularities in audit data of IDEVALIDEVAL is basically the only dataset of this kind which is freelyavailable alongwith truth files; in particular we used the dataset[Lippmann et al., ]. ese data are artificially generated andcontain both network and host auditing data. A common ques-tion is how realistic these data are. Many authors already analyzedthe network data of the dataset, finding many shortcomings[McHugh, ; Mahoney and Chan, a]. Our own analysis,published in [Maggi et al., a], of the host-based auditingdata revealed that this part of the dataset is all but immune fromproblems. e first problem is that in the training datasets thereare too few execution instances for each software, in order to prop-erly model its behavior, as can be seen in Table .. Out of (just) programs present, for two (fdformat and eject), only a handful ofexecutions is available, making training unrealistically simple.

e number of system calls used is also extremely limited, mak-ing execution flows very plain. Additionally, most of these exe-cutions are similar, not covering the full range of possible execu-tion paths of the programs (thus causing overfitting of any anomalymodel). For instance, in Figure . we have plotted the frequency ofthe length (in system calls) of the various executions of telnetd onthe training data. e natural clustering of the data in a few groupsclearly shows how the executions of the program are sequentiallygenerated with some script, and suffer of a lack of generality.

System calls arguments show the same lack of variability: in allthe training dataset, all the arguments of the system calls related totelnetd belong to the following set:

fork, .so.1, utmp, wtmp, initpipe, exec, netconfig,service door, :zero, logindmux, pts

e application layer contains many flaws, too. For instance,the FTP operations ( sessions on the whole) use a very limited

.. Evaluation Issues and Challenges

0

100

200

300

400

500

600

700

25 30 35 40 45 50 55 60 65 70

Num

ber

of o

ccur

renc

ies

Distance in syscalls

F .: telnetd: distribution of the number of other systemcalls among two execve system calls (i.e., distance between two con-secutive execve).

subset of file (on average per session), and are performed alwaysby the same users on the same files, for a limitation of the syn-thetic generator of these operations. In addition, during training,no uploads or idle sessions were performed. Command executionsare also highly predictable: for instance, one script always executea cycle composed of cat, mail, mail again, and at times lynx, some-times repeated twice. e same happens (but in a random order)for rm, sh, ps and ls. In addition, a number of processes have evi-dently crafted names (e.g. logout is sometimes renamed lockout orlog0ut); the same thing happens with path names, which are some-times different (e.g. /usr/bin/lynx or /opt/local/bin/lynx), but ananalysis shows that they are the same programs (perhaps symboliclinks generated to create noise over the data). e combination ofthe two creates interesting results such as /etc/loKout or /opt/lo-cal/bin/l0gout. In a number of cases, processes lynx, mail and qhave duplicate executions with identical PID and timestamps, andwith different paths and/or different arguments; this is evidentlyan inexplicable flaw of the dataset. We also found many programexecutions to be curiously meaningless. In fact, the BSM tracesof some processes contain just execve calls, and this happens for of the programs in the testing portion dataset (especially for

. D M A

those with a crafted name, like loKout). It is obvious that testing anhost-based IDS with one-syscall-long sequences does not make alot of sense, not to talk about the relevance of training against suchsequences.

An additional problem is that since , when this datasetwas created, everything changed: the usage of network protocols,the protocols themselves, the operating systems and applicationsused. For instance, all the machines involved are Solaris version.. hosts, which are evidently ancient nowadays. e attacks aresimilarly outdated: the only attack technique used are buffer over-flows, and all the instances are detectable in the execve system callarguments. Nowadays attackers and attack type are much morecomplex than this, operating at various layers of the network andapplication stack, with a wide range of techniques and scenariosthat were just not imaginable in .

To give an idea of this, we were able to create a detector whichfinds all the buffer overflow attacks without any FP: a simple scriptwhich flags as anomalous any argument longer than characterscan do this (because all the overflows occur in the parsing of thecommand line, which is part of the parameters of the execve systemcall which originates the process). is is obviously unrealistic.

Because of the aforementioned lack of alternatives, most exist-ing researches use IDEVAL. is is a crucial factor: any bias orerror in the dataset has influenced, and will influence in the future,the very basic research on this topic.

.. e base-rate fallacye ID is a classification task, thus can be formalized as a Bayesianclassification problem. In particular FPR and DR can be definedas probabilities. More precisely, let us define the following eventpredicates:

• O =“Intrusion”, ¬O =“Non-intrusion”;

• A =“Alert reported”, ¬A =“No alert reported”.

en, we can re-write DR and FPR as follows.

• DR = P (A | O), i.e., the probability to classify an intrusiveevent as an actual intrusion.

.. Evaluation Issues and Challenges

• FPR = P (A | ¬O), i.e., the probability to classify a legitevent as an intrusion.

Given the above definition and the Bayes’ theorem, two measuresthat are more interesting than DR and FPR can be calculated.

Definition .. (Bayesian DR) eBayesianDR [Axelsson, b]is defined as:

P (O | A) = P (O) · P (A | O)

P (O) · P (A | O) + P (¬O) · P (A | ¬O).

P (O) is the base-rate of intrusions, i.e., the probability for an in-trusion to take place, regardless of the presence of an IDS. eBayesian DR not only quantifies the probability for an alert to in-dicate an intrusion, it also take into account how frequently an in-trusion really happens. As pointed out by [Axelsson, b], inthis equation the FPR = P (A | ¬O) is strictly dominated byP (¬O). In fact, in the real world, P (O) is very low (e.g., 10−5,given two intrusions per day, ,, audit records per day and records per intrusion) and thus P (O) → 1. is phenomenon,called base-rate fallacy, magnifies the presence of FP. In fact, evenin the unrealistic case of DR = 1, a very low FPR = 10−5 quicklydrops the DR to 0.0066, which is three orders of magnitude below.

Besides its impact on the evaluation of an IDS, the base-ratefallacy has a practical impact. When inspecting the alerts log, thesecurity officer would indeed tend to safely ignore most of the alertsbecause the past alarms have been shown to be incorrect.

. D M A

. Concluding Remarks

In this chapter we first introduced the basic concepts and defini-tions of ID, including the most relevant issues that arise duringexperimental evaluation. ID techniques play a fundamental roleto recognize malicious activity in todays’ scenario. In particular,learning-based anomaly detection techniques have been shown tobe particularly interesting since, basically, they implement a black-box IDS which is easy to deploy and requires no or scarce main-tenance efforts. ese systems, however, are not with their draw-backs. From the industrial point of view, the amount of FP gen-erated by anomaly-based systems is not negligible; further exac-erbating this problem is the base-rate fallacy summarized in Sec-tion ... In fact, if one considers the popularity of real attacks,even a minuscule FPR is magnified and instantly becomes a costfor the organization. From the research point of view, the lackof a well-established methodology to evaluate IDS is an issue; inaddition to this, the generation of a reliable testing dataset is anopen research problem. Todays’ evaluation is limited to two, ma-jor datasets: IDEVAL, which, besides being deprecated, containsseveral regularities that make the evaluation extremely biased. Analternative is the modern Cyber Defense eXercise (CDX) la-beled dataset collected during a security competition; this is clearlybetter than IDEVAL if ignoring the fact that the labeling phaseis consists in running Snort. is inherently assumes Snort as theevaluation baseline of every IDS.

Secondly, focusing on anomaly-based techniques, we reviewedthe most recent state-of-the-art IDSs to protect a host applicationand a web server. We also overviewed a few approaches to captureand analyze the network traffic as a stream of packets. is topicis included in the reviewed literature because the contributions inChapter work on alerts generated by host- and network-basedsystems. Network-based IDS are popular as they are easy to deployand can protect a wide range of machine (i.e., the whole network);however, it has been shown how these systems can be evaded byexploiting the lack of local knowledge on the single host. e needof both global and local knowledge about a network is probably themain motivation in favor of alert correlation systems. e most ad-vanced network-based anomaly detectors inspect packet up to theTCP layer and exploit payload clustering and classification tech-

.. Concluding Remarks

niques to recognize anomalous traffic.On the other hand, host-based systems have been shown to be

effective at detecting malicious processes on a single computer. Al-most all the reviewed approaches analyze the system calls inter-cepted in the operating system kernel. Some tools known to workwell for network-based systems have been utilized in this field aswell: for instance, clustering and Bayesian classification techniquesof network packets has been adapted by several approaches to clus-ter similar system calls and flag process with low Bayesian proba-bility as malicious, respectively. Stochastic and deterministic tech-niques have been shown to be very effective. In particular, as wedetail in Section ., deterministic models such as automata arewell suited to capture a process’ control flow. On the other hand,stochastic techniques such as character distribution or Gaussian in-tervals have been show to correctly model the data flow (e.g., thecontent of system call arguments) with low FP.

Web-based ID approaches have been overviewed as well. Al-though this topic is relatively new, web-based approaches are en-joying immense popularity. In fact, the tremendous ubiquity of theWeb has become a high-profit opportunity for the undergroundcriminals to spread malware by violating vulnerable, popular web-sites. e research community have immediately recognized therelevance of this problem. As a consequence, the industry startedto adopt web-based protection tools that have became remarkablyaccurate and, basically, ready to be deployed in real-world environ-ments.

Finally, by comparing Table ., ., and . vs. . it canbe immediately noticed how new this problem is. In fact, duringthe last decade a common line for network-based techniques canbe traced (i.e., exploiting payload classification). e same holdsfor both host-based (i.e., the use of hybrid deterministic/stochastictechniques on system call sequences and arguments) and web-based(i.e., the use of ensemble of stochastic models on HTTP parame-ters) techniques, but not for alert correlation approaches. ey in-deed explore the use of different techniques, but no well-establishedideas can be recognized, yet.

Host-based Anomaly Detection

A host-based anomaly detector builds profiles of the system ac-tivity by observing data collected on a single host (e.g., a personalcomputer, a database server or a web server). By adopting learningtechniques, a host-based IDS can automatically capture the normalbehavior of the host and flag significant deviations.

In this chapter we describe in details two contributions we pro-posed to mitigate malicious activity in a POSIX-like operating sys-tem. In both these works, our tools analyze the activity at the gran-ularity of the system call (e.g., open, read, chmod) and monitor eachrunning process in parallel using a multi-threaded scheduling. Ourapproaches integrate the most advanced techniques that, accord-ing to the recent literature, have been shown to be effective. Wealso propose novel techniques, such as clustering of system calls toclassify system calls and Markov models to capture each process’behavior, and their refinements to further lower the FPR. All thesystems described in this chapter have been evaluated using bothIDEVAL and a more realistic dataset that also includes new at-tacks (details on the generation of such a dataset are described). Inaddition, we describe our tests on a dataset we created through theimplementation of an innovative technique of anti-forensics, andwe show that our approach yields promising results in terms of de-tection. e goal of this extensive set of experiments is to use our

. H- A D

prototype to circumvent definitive anti-forensics tools. Basically,we demonstrate how our tool can detect stealthy in-memory injec-tions of executable code, and in-memory execution of binaries (theso-called “userland exec” technique, which we re-implement in areliable way).

.. Preliminaries

. Preliminaries

In order to avoid the shortcomings mentioned in Section ., be-sides the use of IDEVAL for comparison purposes with SyscallAno-maly (which was tested on that dataset), we generated an additionalexperimental dataset for other frequently used console applications.We chose different buffer overflow exploits that allow to execute ar-bitrary code.

More precisely, in the case of mcweject 0.9, the vulnerabil-ity [National Vulnerability Database, a] is a very simple stackoverflow, caused by improper bounds checking. By passing a longargument on the command line, an aggressor can execute arbitrarycode on the system with root privileges. ere is a public exploitfor the vulnerability [Harry, ] which we modified slightly tosuit our purposes and execute our own payload. e attack againstbsdtar is based on a publicly disclosed vulnerability in the PAXhan-dling functions of libarchive 2.2.3 and earlier [National Vulnera-bility Database, b], where a function in file archive read sup-port format tar.c does not properly compute the length of a bufferwhen processing a malformed PAX archive extension header (i.e., itdoes not check the length of the header as stored in a header field),resulting in a heap overflow which allows code injection throughthe creation of a malformed PAX archive which is subsequentlyextracted by an unsuspecting user on the target machine. In thiscase, we developed our own exploit, as none was available online,probably due to the fact that this is a heap overflow and requiresa slightly more sophisticated exploitation vector. In particular, theheap overflow allows to overwrite a pointer to a structure whichcontains a pointer to a function which is called soon after the over-flow. So, our exploit overwrites this pointer, redirecting it to theinjected buffer. In the buffer we craft a clone of the structure, whichcontains a pointer to the shellcode in place of the correct functionpointer.

Our testing platform runs a vanilla installation of FreeBSD .on a x machine; the kernel has been recompiled enabling theappropriate auditing modules. Since our systems, and other host-based anomaly detectors [Kruegel et al., a; Mutz et al., ],accept input in the BSM format, the OpenBSM [Watson and Sala-mon, ] auditing tools collection has been used for collectingaudit trails. We have audited vulnerable releases of eject and bs-

. H- A D

dtar, namely: mcweject 0.9 (which is an alternative to the BSDeject) and the version of bsdtar which is distributed with FreeBSD.. During the generation process, the audit trails keep changing,along with the simulated user behavior. It is important to under-line that normal users would never use really random names fortheir files and directories, they usually prefer to use words fromtheir tongue plus a limited set of characters (e.g., ., -, ) for con-catenating them. erefore, we rely on a large dictionary of wordsfor generating file names.

e eject executable has a small set of command line optionand a very plain execution flow. For the simulation of a legitimateuser, we simply chose different permutations of flags and differentdevices. For this executable, we manually generated executions,which are remarkably similar (as expected).

Creating a dataset of normal activity for the bsdtar program ismore challenging. It has a large set of command line options, andin general is more complex than eject. While the latter is generallycalled with an argument of /dev/*, the former can be invoked withany argument string, for instance bsdtar cf myarchive.tar /first/-path /second/random/path is a perfectly legitimate command line.Using a procedure similar to the one used for creating the IDE-VAL dataset, and in fact used also in [Sekar et al., ], we pre-pared a shell script which embeds pseudo-random behaviors of anaverage user who creates or extracts archives. To avoid the regu-larities found in IDEVAL, to simulate user activity, the script ran-domly creates random-sized, random-content files inside a snap-shot of a real-world desktop file-system. In the case of the simu-lation of super-user executions, these files are scattered around thesystem; in the case of a regular user, they are into that user’s ownhome directory. Once the file-system has been populated, the toolrandomly walks through the directory tree and randomly createsTAR archives. Similarly, found archives are randomly expanded.e randomization takes also into account the different use of flagsmade by users: for instance, some users prefer to uncompress anarchive using tar xf archive.tar, many others still use the dash tar-xf archive.tar, and so on.

In addition to the aforementioned datasets, we used attacksagainst sing, mt-daapd, proftpd, sudo, and BitchX. To generate cleantraining data we followed similar randomization and scriptingmech-anisms described previously.

.. Preliminaries

Specifically, sing is affected by CVE--, a vulnerabilitywhich allows to write arbitrary text on arbitrary files by exploitinga combination of parameters. is attack is meaningful because itdoes not alter the control flow, but just the data flow, with an openwhich writes on unusual files. Training datasets contain traces ofregular usage of the program invoked with large sets of commandline options.

mt-daapd is affected by a format string vulnerability (CVE--) in ws addarg(). It allows remote execution of arbitrary code byincluding the format specifiers in the username or password portionof the base-encoded data on the Authorization: Basic HTTPheader sent to /xml-rpc. e mod ctrls module of proftpd let localattackers to fully control the integer regarglen (CVE--)and exploit a stack overflow to gain root privileges.

sudo does not properly sanitize data supplied through SHELLOPTSand PS4 environment variables, which are passed on to the invokedprogram (CVE--). is leads to the execution of arbi-trary commands as privileged user, and it can be exploited by userswho have been granted limited superuser privileges. e train-ing set includes a number of execution of programs commonly runthrough sudo (e.g., passwd, adduser, editing of /etc/ files) by varioususers with different, limited superuser privileges, along with benigntraces similar to the attacks, invoked using several permutations ofoption flags.

BitchX is affected by CVE--, which allows a remoteattacker to execute arbitrary commands by overfilling a hash tableand injecting an EXEC hook function which receives and executesshell commands. Moreover, failed exploit attempts can cause DoS.e training set includes several Internet Relay Chat (IRC) clientsessions and a legal IRC session to a server having the same addressof the malicious one.

Note .. First, it is important to underline that the scripts thatwe prepared to set up the experiments are only meant to generatethe system calls that are generated when a particular executable isstimulated with different command line options and different in-puts. By no means we claim that such scripts can emulate a user’soverall activity. Although the generation of large dataset for IDSevaluation goes beyond the scope of our work, these scripts at-tempt to reflect the way a regular user invokes tar or eject; this

. H- A D

was possible because they are both simple programs which requirea limited number of command line options. Clearly, generatingsuch a dataset for more sophisticated applications (e.g., a browser,a highly-interactive graphic tool) would be much more difficult.

Secondly, we recall that these scripts have been set up only forcollecting clean data. Such data collection is not needed by our sys-tem when running, since the user data would be already available.

In [Bhatkar et al., ] a real web and SSH server logs wereused for testing. While this approach yields interesting results, wedid not follow it for three reasons. Firstly, in our country variouslegal concerns limit what can be logged on real-world servers. Insecond place, HTTP and SSH are complex programs where un-derstanding what is correctly identified and what is not would bedifficult (as opposed to simply counting correct and false alerts). Fi-nally, such a dataset would not be reliable because of the possibilityof the presence of real attacks inside the collected logs (in additionto the attacks inserted manually for testing).

.. Malicious System Calls Detection

. Malicious System Calls Detection

In this section we describe our contribution regarding anomaly de-tection of host-based attacks by exploiting unsupervised system callsarguments and sequences [Maggi et al., a]. Analyzing boththe theoretical foundations described in [Kruegel et al., a;Mutzet al., ], and the results of our tests, we proposed an alterna-tive system, which improves some of the ideas of SyscallAnoma-ly (the IDS developed on top of LibAnomaly) along with cluster-ing, Markov based modeling, and behavior identification. e ap-proach is implemented in a prototype called Syscall Sequence Argu-ments Anomaly Detection Engine (SADE ), written in AmericanNational Standard Institute (ANSI) C. A set of anomaly detectionmodels for the individual parameters of the calls is defined. en,a clustering process which helps to better fit models to system callarguments, and creates inter-relations among different argumentsof a system call is defined by means of ad-hoc distance metrics. Fi-nally, we exploit Markov models to encode the monitored process’normal behavior. e resulting system needs no prior knowledgeinput; it has a good FPR, and it is also able to correctly contextu-alize alarms, giving the security officer more information to under-stand whether a TP or FP happened, and to detect variations overthe entire execution flow, as opposed to punctual variations overindividual instances.

SADE uses the sequence as well as the parameters of the sys-tem calls executed by a process to identify anomalous behaviors. Asdetailed in Section .., the use of system calls as anomaly indi-cators is well established in literature (e.g. in [Forrest et al., ;Cabrera et al., ; Hofmeyr et al., ; Somayaji and Forrest,; Cohen, ; Lee and Stolfo, ; Ourston et al., ; Jhaet al., ; Michael and Ghosh, ; Sekar et al., ; Wagnerand Dean, ; Giffin et al., ; Yeung and Ding, ]), usu-ally without handling their parameters (with the notable exceptionsof [Kruegel et al., a; Tandon and Chan, ; Bhatkar et al.,]). SADE is an improvement to the existing tool by meansof four key novel contributions:

• we build and carefully test anomaly detection models for sys-tem call parameters, in a similar way to [Kruegel et al., a];

• we introduce the concept of clustering arguments in order to

. H- A D

automatically infer different ways to use the same system call;this leads to more precise models of normality on the argu-ments;

• the same concept of clustering also creates correlations amongthe different parameters of a same system call, which is notpresent in any form in [Kruegel et al., a; Tandon andChan, ; Bhatkar et al., ];

• the traditional detector based on deviations from previouslylearned Markov models is complemented with the concept ofclustering; the sequence of system calls is transformed into asequence of labels (i.e., classes of calls): this is conceptuallydifferent than what has been done in other works (such as[Tandon and Chan, ]), where sequences of events andsingle events by themselves are both taken into account butin an orthogonal way.

e resultingmodel is also able to correctly contextualize alarms,providing the user withmore information to understandwhat causedany FP, and to detect variations over the execution flow, as opposedto variations over single system call. We also discuss in depth howwe performed the implementation and the evaluation of the pro-totype, trying to identify and overcome the pitfalls associated withthe usage of the IDEVAL dataset.

.. Analysis of SyscallAnomalyIn order to replicate the original tests of SyscallAnomaly, we usedthe host-based auditing data in BSM format contained in the IDE-VAL dataset. For now, it is sufficient to note that we used the BSMaudit logs from the system named pascal.eyrie.af.mil, which runsa Solaris .. operating system. Before describing the dataset used,we briefly summarize the models used by SyscallAnomaly: stringlength model, the character distribution model, the structural in-ference model and the token search model.

string length model —computes, from the strings seen in the train-ing phase, the samplemeanµ and variance σ2 of their lengths.In the detection phase, given l, the length of the observedstring, the likelihood p of the input string length with respect

.. Malicious System Calls Detection

to the values observed in training is equal to one if l < µ+ σ

and σ2

(l−µ)2 otherwise.

character distribution model — analyzes the discrete probabilitydistribution of characters in a string. Each string is consid-ered as a set of characters, which are inserted into an his-togram, in decreasing order of occurrence, with a classicalrank order/frequency representation. For detection, aχ2 Pear-son test returns the likelihood that the observed string his-togram comes from the learned model.

structural inference model — learns the structure of strings, thatare first simplified using the following translation rules: [A-Z] → A, [a-z] → a, [0-9] → 0. Finally, multiple occurrencesof the same character are simplified. For instance, /usr/lib-/libc.so is translated into /aaa/aaa/aaaa.aa, and further com-pressed into /a/a/a.a. A probabilistic grammar encoded ina Hidden Markov Model (HMM) is generated by exploitinga Bayesian merging procedure, as described in [Stolcke andOmohundro, b, c,b].

token search model — applied to arguments which contain flagsor modes. is model uses a statistical test to determinewhether or not an argument contains a finite set of values.e core idea (drawn from [Lee et al., ]) is that if the setis finite, then the number of different arguments in the train-ing set will grow in a much slower way than the total numberof samples. is is tested using a Kolgomorov-Smirnov nonparametric test. If the field contains a set of tokens, the set ofvalues observed during training is stored. During detection,if the field has been flagged as a token, the input is comparedagainst the stored values list. If it matches a former input,the model returns , else it returns , without regard to therelative frequency of the tokens in the training data.

A confidence rating is computed at training for each model, bydetermining how well it fits its training set; this value is meant to beused at runtime to provide additional information on the reliabilityof the model.

Returning to the analysis, we used a dataset that contains buffer overflow attacks against different applications: eject, fd-

. H- A D

P SyscallAnomaly SADEfdformat ()eject ()ps ()ftpd ()telnetd ()sendmail ()

Table .: Comparison of SyscallAnomaly (results taken from[Kruegel et al., a]) and SADE in terms of number FPs onthe IDEVAL dataset. For SADE , the amount of system callsflagged as anomalous is reported between brackets.

format, ps, and ffbconfig (not tested). We used data from weeks and for training, and data from weeks and for testing thedetection phase. However, it must be noted that some attacks arenot directly detectable through system call analysis. e most in-teresting attacks for testing SyscallAnomaly are the ones in whichan attacker exploits a vulnerability in a local or remote service toallow an intruder to obtain or escalate privileges.

In addition to the programs named above, we ran SyscallAnoma-ly also on three other programs, namely ftpd, sendmail and telnetd,which are known not to be subject to any attack in the dataset,in order to better evaluate the FPR of the system. In Table .we compare our results with the released version of SyscallAnomaly[Kruegel et al., ] to reproduce the results reported in [Kruegelet al., a].

As can be seen, our results are different from those reported in[Kruegel et al., a], but the discrepancy can be explained by anumber of factors:

• the version of SyscallAnomaly and LibAnomaly available on-line could be different from or older than the one used forthe published tests;

• several parameters can be tuned in SyscallAnomaly, and a dif-ferent tuning could produce different results;

.. Malicious System Calls Detection

• part of the data in the IDEVAL dataset under considerationare corrupted or malformed;

• in [Kruegel et al., a] it is unclear if the number of FPsis based on the number of executions erroneously flagged asanomalous, or on the number of anomalous syscalls detected.

ese discrepancies make a direct comparison difficult, but ournumbers confirm that SyscallAnomaly performs well overall as a de-tector. However, FPs and detected anomalies are interesting tostudy, in order to better understand how and where SyscallAnoma-ly fails, and to improve it. erefore we analyzed in depth a numberof executions. Just to give a brief example, let us consider eject: itis a very plain, short program, used to eject removable media onUNIX-like systems: it has a very simple and predictable execu-tion flow, and thus it is straightforward to characterize: dynamiclibraries are loaded, the device vol@0:volctl is accessed, and finally,the device unnamed floppy is accessed.

e dataset contains only one kind of attack against eject: itis a buffer overflow with command execution (see Table .). eexploit is evident in the execve system call, since the buffer overflowis exploited from the command line. Many of the models in Sys-callAnomaly are able to detect this problem: the character distribu-tion model, for instance, performs quite well. e anomaly valueturns out to be ., much higher than the threshold (.).e string length and structural inference models flag this anomalyas well, but interestingly they are mostly ignored since their confi-dence value is too low. e confidence value for the token model is, which in SyscallAnomaly convention means that the field is notrecognized as a token. is is actually a shortcoming of the asso-ciation of models with parameters in SyscallAnomaly, because the“filename” argument is not really a token.

A FP happens when a removable unit, unseen during training,is opened (see Table .). e structural inference model is the cul-prit of the false alert, since the name structure is different from theprevious one for the presence of an underscore. As we will see lateron, the extreme brittleness of the transformation and simplificationmodel is the main weakness of the Structural Inference model.

Another alert happens in the opening of a localization file (Ta-ble .), which triggers the string lengthmodel and creates an anoma-

. H- A D

T (execve)file /usr/bin/eject

argv eject\0x20\0x20\0x20\0x20[...]Models Prob. (Conf.)

file Token Search . ()String Length 10−6 ()

argv Character Distribution . (.)Structural Inference 10−6 (.)

Total Score (reshold) . (.)

F (open)file /vol/dev/rdiskette0/c0t6d0/volume 1

flags crw-rw-rw-

Models Prob. (Conf.)String Length . (.)

path Character Distribution . (.)Structural Inference 10−6 ()

Token Search . ()Total Score (reshold) . (.)

Table .: A true positive and a FP on eject.

lous distribution of characters; moreover, the presence of numbers,underscores and capitals creates a structure that is flagged as anoma-lous by the structural inference model. e anomaly in the tokensearch model is due to the fact that the open mode (-r-xr-xr-x)is not present in any of the training files. is is not an attack,but is the consequence of the buffer overflow attack, and as such iscounted as a true positive. However, it is more likely to be a lucky,random side effect.

Without getting into similar details for all the other programswe analyzed (details which can be found in Section ..), let ussummarize our findings. ps is a jack-of-all-trades program to mon-itor process execution, and as such is much more articulated in its

.. Malicious System Calls Detection

T (open)path /usr/lib/locale/iso 8859 1/[...]flags -r-xr-xr-x

Models Prob. (Conf.)String Length . (.)

Character Distribution . (.)Structural Inference 10−6 (.)

Token Search 10−6 ()Total Score (reshold) . (.)

Table .: True positive on fdformat while opening a localizationfile.

options and execution flow than any of the previously analyzed exe-cutables. However, the sequence of system calls does not vary dra-matically depending on the user specified options: besides libraryloading, the program opens /tmp/ps data and the files containingprocess information in /proc. Also in this case, attacks are bufferoverflows on a command-line parameter. In this case, as was thecase for fdformat, a correlated event is also detected, the openingof file /tmp/foo instead of file /tmp/ps data. Both the Token Searchmodel and the Structural Inference model flag an anomaly, becausethe opening mode is unseen before, and because the presence of anunderscore in /tmp/ps data makes it structurally different from /tm-p/foo. However, if we modify the exploit to use /tmp/foo data, theStructural Inference model goes quiet. A FP happens when ps isexecuted with options lux, because the structural inference modelfinds this usage of parameters very different from -lux (with a dash),and therefore strongly believes this to be an attack. Another FPhappens when a zone file is opened, because during training nofiles in zoneinfo were opened. In this case it is very evident thatthe detection of the opening of the /tmp/foo file is more of anotherrandom side effect than a detection, and in fact the model whichcorrectly identifies it also creates FPs for many other instances. Inthe case of in.ftpd, a common FTP server, a variety of commandscould be expected. However, because of the shortcomings of theIDEVAL dataset detailed in Section .., the system call flow is

. H- A D

fairly regular. After access to libraries and configuration files, thelogon events are recorded into system log files, and a vfork call isthen executed to create a child process to actually serve the client re-quests. In this case, the FPs mostly happen because of the openingof files never accessed during training, or with “unusual modes”.

sendmail is a really complex program, with complex executionflows that include opening libraries and configuration files, access-ing the mail queue (/var/spool/mqueue), transmitting data throughthe network and/or saving mails on disk. Temporary files are used,and the setuid call is also used, with an argument set to the recipientof the message (for delivery to local users).

A FP happens for instance when sendmail uses UID to de-liver a message. In training that particular UID was not used, so themodel flags it as anomalous. Since this can happen in the normalbehavior of the system, it is evidently a generic problem with themodeling of UIDs as it is done in LibAnomaly. Operations in /var/-mail are flagged as anomalous because the file names are similar to/var/mail/emonca000Sh, and thus the alternation of lower and uppercase characters and numbers easily triggers the Structural Inferencemodel.

We outlined different cases of failure of SyscallAnomaly. Butwhat are the underlying reasons for these failures? e structuralinference model is probably the weakest overall. It is too sensitiveagainst non alphanumeric characters, since they are not altered norcompressed: therefore, it reacts strongly against slight modifica-tions that involve these characters. is becomes visible when li-braries with variable names are opened, as it is evident in the FPsgenerated on the ps program. On the other hand, the compres-sions and simplifications introduced are excessive, and cancel outany interesting feature: for instance, the strings /tmp/tempfilenameand /etc/shadow are indistinguishable by the model. Another verysurprising thing, as we already noticed, is the choice of ignoringthe probability values in the Markov model, turning it into a bi-nary value ( if the string cannot be generated, otherwise). isassumes an excessive weight in the total probability value, easilycausing a false alarm.

To verify our intuition, we re-ran the tests excluding the Struc-tural Inferencemodel: theDR is unchanged, while the FPR stronglydiminishes, as shown in Table . (once again, both in terms ofglobal number of alerts, and of flagged system calls). erefore,

.. Malicious System Calls Detection

E F P (S )With the HMM Without the HMM

fdformat () ()eject () ()ps () ()ftpd () ()telnetd () ()sendmail () ()

Table .: Behavior of SyscallAnomaly with and without the Struc-tural Inference Model

the Structural Inference model is not contributing to detection; in-stead it is just causing a growth in the anomaly scores which couldlead to an increased number of false positives. e case of telnetdis particularly striking: excluding the Structural Inference modelmakes all the false positives disappear.

e character distribution model is much more reliable, and con-tributes positively to detection. However, it is not accurate aboutthe particular distribution of each character, and this can lead topossible mimicry attacks. For instance, executing ps -[x) has avery high probability, because it is indistinguishable from the usualform of the command ps -axu.

e token search model has various flaws. First of all, it is notprobabilistic, as it does not consider the relative probability of thedifferent values. erefore, a token with occurrences is con-sidered just as likely as one with a single occurrence in the wholetraining set. is makes the training phase not robust against thepresence of outliers or attacks in the training dataset. Additionally,since the model is applied only to fields that have been determinedbeforehand to contain a token, the statistical test is not useful: infact, in all our experiments, it never had a single negative result.It is also noteworthy that the actual implementation of this test in[Kruegel et al., ] differs from what is documented in [Kruegelet al., a; Mutz et al., ; Lee et al., ].

Finally, the string length model works very well, even if this isin part due to the artifacts in the dataset, as we describe in Sec-tion ...

. H- A D

.. Improving SyscallAnomaly

We can identify and propose three key improvements over Syscall-Anomaly. Firstly, we redesign improved models for anomaly de-tection on arguments, focusing on their reliability. Over these im-proved models, we introduce a clustering phase to create correla-tions among the various models on different arguments of the samesystem call: basically, we divide the set of the invocations of a sin-gle system call into subsets which have arguments with an highersimilarity. is idea arises from the consideration that some sys-tem calls do not exhibit a single normal behavior, but a pluralityof behaviors (ways of use) in different portions of a program. Forinstance, as we will see in the next sections, an open system call canhave a very different set of arguments when used to load a sharedlibrary or a user-supplied file. erefore, the clustering step aims tocapture relationships among the values of various arguments (e.g.to create correlations among some filenames and specific openingmodes). In this way we can achieve better characterization.

Finally, we introduce a sequence-based correlationmodel througha Markov chain. is enables the system to detect deviations in thecontrol flow of a program, as well as abnormalities in each individ-ual call, making evident the whole anomalous context that arises asa consequence, not just the single point of an attack.

e combination of these improvements solves the problems weoutlined in the previous sections, and the resulting prototype thusoutperforms SyscallAnomaly, achieving also a better generality.

... Distance metrics for system calls clustering

We applied a hierarchical agglomerative clustering algorithm [HanandKamber, ] to find, for each system call, subclusters of invo-cation with similar arguments; we are interested in creating modelson these clusters, and not on the general system call, in order tobetter capture normality and deviations. is is important because,as can be seen from Table ., in the IDEVAL dataset the singlesystem call open constitutes up to of the calls performed by aprocess. Indeed, open (and probably read) is one of the most usedsystem calls on UNIX-like systems, since it opens a file or devicein the file system and creates a handle (descriptor) for further use.open has three parameters: the file path, a set of flags indicating the

.. Malicious System Calls Detection

E open Efdformat . eject . ps . telnetd . ftpd . sendmail .

Table .: Percentage of open syscalls and number of executions(per program) in the IDEVAL dataset.

type of operation (e.g. read-only, read-write, append, create if nonexisting, etc.), and an optional opening mode, which specifies thepermissions to set in case the file is created. Only by careful aggre-gation over these parameters we may divide each “polyfunctional”system call into “subgroups” that are specific to a single function.

We used a single-linkage, bottom-up agglomerative technique.Conceptually, such an algorithm initially assigns each of the N in-put elements to a different cluster, and computes anN×N distancematrix D. Distances among clusters are computed as the minimumdistance between an element of the first cluster and an element ofthe second cluster. e algorithm progressively joins the elementsi and j such that D(i, j) = min(D). D is updated by substituting iand j rows and columns with the row and column of the distancesbetween the newly joined cluster and the remaining ones. e min-imum distance between two different clusters dstop,min is used asa stop criterion, in order to prevent the clustering process fromlumping all of the system calls together; moreover, a lower bounddstop,num for the number of final clusters is used as a stop criterionas well. If any of the stop criteria is satisfied, the process is stopped.e time complexity of a naive implementation is roughly O(N2).is would be too heavy, in both time and memory. Besides in-troducing various tricks to speed up our code and reduce memoryoccupation (as suggested in [Golub and Loan, ]), we intro-duced an heuristic to reduce the average number of steps requiredby the algorithm. Basically, at each step, instead of joining just theelements at minimum distance dmin, also all the elements that areat a distance d < βdmin from both the elements at minimum dis-

. H- A D

E. N Ofdformat .” (.”) .” (.”)eject .” (.”) .” (.”)ps ’” (”) ” (”)sendmail ∞ ’” (’”)

Table .: Execution times with and without the heuristic (and inparenthesis, values obtained by performance tweaks)

tance are joined, where β is a parameter of the algorithm. In thisway, groups of elements that are very close together are joined ina single step, making the algorithm (on average) much faster, evenif worst-case complexity is unaffected. Table . indicates the re-sults measured in the case of dstop,min = 1 and dstop,num = 10; wewant to recall that these timings, seemingly very high, refer to thetraining phase and not to the run time phase.

e core step in creating a good clustering is of course the def-inition of the distance among different sets of arguments. We pro-ceed by comparing corresponding arguments in the calls, and foreach couple of arguments a we compute the following.

Definition .. (Arguments Distance) e argument distance be-tween two instances i and j of the same system call with respect toargument a is defined as:

da(i, j) :=

{Ka + αaδa(i, j) if the elements are different0 otherwise

(.)

where Ka is a fixed quantity which creates a step between differentelements, while the second term is the real distance between thearguments δa(i, j), normalized by a parameter αa. Note that, theabove formula is a template: the index a denotes that such variablesare parametric with respect to the type of argument; howK(·), α(·),and δ(·) are computed will be detailed below for each type of argu-ment. e distance total between two occurrences i and j systemcalls is defined as follows.

.. Malicious System Calls Detection

Definition .. (Total Distance) e total distance between twoinstances i and j of the same system call is defined as:

D(i, j) :=∑a∈A

da(i, j) (.)

where A is the set of system call arguments.

Hierarchical clustering, however, creates a problem for the de-tection phase, since there is no concept analogous to the centroid ofpartitioning algorithms that can be used for classifying new inputs,and we cannot obviously afford to re-cluster everything after eachone. us, we need to generate, from each cluster, a representativemodel that can be used to cluster (or classify) further inputs. isis a well known problem which needs a creative solution. For eachtype of argument, we decided to develop a stochastic model thatcan be used to this end. ese models should be able to associatea probability to inputs, i.e. to generate a probability density functionthat can be used to state the probability with which the input be-longs to the model. As we will see, in most cases, this will be inthe form of a discrete probability, but more complex models such asHMMs will also be used. Moreover, a concept of distance must bedefined between each model and the input. e model should beable to incorporate new candidates during training, and to slowlyadapt in order to represent the whole cluster. It is important tonote that it is not strictly necessary for the candidate model and itsdistance functions to be the same used for clustering purposes. Itis also important to note that the clustering could be influenced bythe presence of outliers (such as attacks) in the training set. iscould lead to the formation of small clusters of anomalous call in-stances. As we will see in Section .., this does not inficiate theability of the overall system to detect anomalies.

As previously stated, at least four different types of argumentsare passed to system calls: path names and file names, discrete nu-meric values, arguments passed to programs for execution, usersand group identifiers (UIDs and GIDs). For each type of argument, wedesigned a representative model and an appropriate distance func-tion.

File name arguments Path names and file names are very fre-quently used in system calls. ey are complex structures, rich of

. H- A D

useful information, and therefore difficult to model properly. Afirst interesting information is commonality of the path, since filesresiding in the same branch of the file system are usually more sim-ilar than the ones in different branches. Usually, inside a path, thefirst and the last directory carry the most significance. If the file-name has a similar structure to other filenames, this is indicative: forinstance common prefixes in the filename, such as the prefix lib, orcommon suffixes such as the extensions.

For the clustering phase, we chose to re-use a very simple modelalready present in SyscallAnomaly, the directory tree depth. is iseasy to compute, and experimentally leads to fairly good results evenif very simple. us, in Equation . we set δa to be the distance indepth. For instance, let Kpath = 5 and αpath = 1; comparing /us-r/lib/libc.so and /etc/passwd we obtain da = 5 + 1 · 1 = 6, whilecomparing /usr/lib/libc.so and /usr/lib/libelf.so.1 we obtainda = 0.

After clustering has been done, we represent the path nameof the files of a cluster with a probabilistic tree which contains allthe directories involved with a probability weight for each. For in-stance, if a cluster contains: /usr/lib/libc.so, /usr/lib/libelf.so,or /usr/local/lib/libintl.so, the generated tree will be as in Figure..

File names are usually too variable, in the context of a singlecluster, to allow a meaningful model to be always created. How-ever, we chose to set up a system-wide threshold below which thefilenames are so regular that they can be considered a model, andthus any other filename can be considered an anomaly. e proba-bility returned by the model is therefore PT = Pt · Pf , where Pt isthe probability that the path has been generated by the probabilistictree and Pf is set to if the filename model is not significant (or ifit is significant and the filename belongs to the learned set), and to if the model is significant and the filename is outside the set.

Note .. (Configuration parameters) According to the experi-ments described in Section .., the best detection accuracy isachieved with the default value of Kpath = 1 and αpath = 1. Inparticular, αpath is just a normalization parameter and, in most ofthe cases, does not require to be changed. Kpath influences thequality of the clustering and thus may need to be changed if someparticular, pathological false positives need to be eliminated in a

.. Malicious System Calls Detection

usr

lib

liblocal

0.67

0.33

1.00

F .: Probabilistic tree example.

specific setting.

Discrete numeric values Flags, opening modes, etc. are usuallychosen from a limited set. erefore we can store all of them alongwith a discrete probability. Since in this case two values can only be“equal” or “different”, we set up a binary distance model for clus-tering.

Definition .. (Discrete numeric args. distance) Let x = idisc,y = jdisc be the values of the arguments of type disc (i.e., discretenumeric argument) of two instances i and j of the same system call.e distance is defined as:

ddisc(i, j) :=

{Kdisc if x = y0 if x = y,

where Kdisc is a configuration parameter.

Models fusion and incorporation of new elements are straight-forward, as well as the generation of probability for a new input tobelong to the model.

Note .. (Configuration parameters) According to the experi-ments described in Section .., the best detection accuracy isachieved with the default value of Kdisc = 1. is parameter influ-ences the quality of the clustering and thus may need to be changedif some particular, pathological false positives need to be eliminatedin a specific setting.

Execution argument We also noticed that execution argument(i.e. the arguments passed to the execve system call) are difficult

. H- A D

to model, but we found the length to be an extremely effective in-dicator of similarity of use. erefore we set up a binary distancemodel.

Definition .. (Execution arguments distance) Let x = idisc andy = jdisc be the string values of the arguments of type disc (i.e., ex-ecution argument) of two instances i and j of the same system call.e distance is defined as:

darg(i, j) :=

{Karg if |x| = |y|0 if |x| = |y|,

where Karg is a configuration parameter and |x| is the lengthof string x.

Note .. (Configuration parameters) According to the experi-ments described in Section .., the best detection accuracy isachieved with the default value of Karg = 1. is parameter influ-ences the quality of the clustering and thus may need to be changedif some particular, pathological false positives need to be eliminatedin a specific setting.

In this way, arguments with the same length are clustered to-gether. For each cluster, we compute the minimum and maximumvalue of the length of arguments. Fusion of models and incorpo-ration of new elements are straightforward. e probability for anew input to belong to the model is if its length belongs to theinterval, and otherwise.

Token arguments Many arguments express UIDs or GIDs, so wedeveloped an ad-hoc model for users and groups identifiers. Ourreasoning is that all these discrete values have three different mean-ings: UID is reserved to the super-user, low values usually are forsystem special users, while real users have UIDs and GIDs above athreshold (usually ). So, we divided the input space in thesethree groups, and computed the distance for clustering using thefollowing definition.

Definition .. (UID/GID argument distance) Let x = idisc andy = jdisc be the values of the arguments of type disc (i.e., GID/UIDargument) of two instances i and j of the same system call.

.. Malicious System Calls Detection

duid(i, j) :=

{Kuid if g(x) = g(y)0 if g(x) = g(y)

,

where Kuid is a configuration parameter, the function g : N 7→{groups} and {groups} is the set of user groups.

Since UIDs are limited in number, they are preserved for test-ing, without associating a discrete probability to them. Fusion ofmodels and incorporation of new elements are straightforward. eprobability for a new input to belong to the model is if the UIDbelongs to the learned set, and otherwise.

Note .. (Configuration parameters) According to the experi-ments described in Section .., the best detection accuracy isachieved with the default value of Kuid = 1. is parameter influ-ences the quality of the clustering and thus may need to be changedif some particular, pathological false positives need to be eliminatedin a specific setting.

In Table . we summarize the association of the models de-scribed above with the arguments of each of the system calls wetake into account. is model based clustering is somehow errorprone since we would expect obtained centroids to be more generaland thus somehow to interfere when clustering either new or oldinstances. To double check this possible issue we follow a simpleprocess:

. creation of clusters on the training dataset;

. generation of models from each cluster;

. use of models to classify the original dataset into clusters, andcheck that inputs are correctly assigned to the same clusterthey contributed to create.

is is done both for checking the representativeness of themodels, and to double-check that the different distances computedmake sense and separate between different clusters.

As Table . shows, for each program in the IDEVAL dataset(considering the representative open system call), the percentage ofinputs correctly classified, and a confidence value, computed as the

. H- A D

S M open pathname Path

flags Discrete Numericmode Discrete Numeric

execve filename Pathargv Execution Argument

setuid uid User/Groupsetgid gid User/Group

setreuid ruid User/Groupsetregid euid User/Group

setresuid, setresgid ruid User/Groupeuid User/Groupsuid User/Group

symlink, link oldpath Pathrename newpath Pathmount source Path

target Pathflags Discrete Numeric

umount target Pathflags Path

exit status Discrete Numericchown path Path

lchown group User/Groupowner User/Group

chmod, mkdir path Pathcreat mode Discrete Numeric

mknode pathname Pathmode Discrete Numericdev Discrete Numeric

unlink, rmdir pathname Path

Table .: Association of models to system call arguments in ourprototype

.. Malicious System Calls Detection

E C.fdformat eject ps telnetd .ftpd . .sendmail .

Table .: Cluster validation process.

average “probability to belong” computed for each element w.r.t.the cluster it helped to build. e results are almost perfect, asexpected, with a lower value for the ftpd program, which has a widervariability in file names.

.. Capturing process behaviorTo take into account the execution context of each system call, weuse a first order Markov chain to represent the program flow. emodel states represent the system calls, or better they represent thevarious clusters of each system call, as detected during the cluster-ing process. For instance, if we detected three clusters in the opensystem call, and two in the execve system call, then the model willbe constituted by five states: open1, open2, open3, execve1, execve2.Each transition will reflect the probability of passing from one ofthese groups to another through the program. As we already ob-served in Section .., this approach was investigated in formerliterature, but never in conjunction with the handling of parame-ters and with a clustering approach.

A time domain process demonstrates a Markov property if theconditional probability density of the current event, given all presentand past events, depends only on the K most recent events. Kis known as the order of the underlying model. Usually, modelsof order K = 1 are considered, because they are simpler to ana-lyze mathematically. Higher-order models can usually be approxi-mated with first order models, but approaches for using high-orderMarkov models in an efficient manner have also been proposed,even in the intrusion detection field [Ju and Vardi, ]. A good

. H- A D

estimate of the order can be extracted from data using the crite-ria defined in [Merhav et al., ]; for normal Markov models,a χ2-test for first against second order dependency can be used[Haccou and Meelis, ], but also an information criterion suchas Bayesian Information Criterion (BIC) or Minimum DescriptionLength (MDL) can be used.

... Training Phase

Each execution of the program in the training set is considered asa sequence of observations. Using the output of the clustering pro-cess, each system call is classified into the correct cluster, by com-puting the probability value for eachmodel and choosing the clusterwhose models give out the maximum composite probability alongall known models: max(

∏i∈M Pi). e probabilities of the Markov

model are then straightforward to compute. e final results can besimilar to what is shown in .. On a first sight, this could resem-ble a simple Markov chain, however it should be noticed that eachof the state of this Markov model could be mapped into the set ofpossible cluster elements associated to it. From this perspective, itcould be seen as a very specific HMM where states are fully observ-able through an uniform emission probability over the associatedsyscalls. By simply turning the clustering algorithm into one withoverlapping clusters we could obtain a proper HMM description ofthe user behavior as it would be if we decide to further merge statestogether. is is actually our ongoing work toward full HMM be-havior modeling with the aim, through overlapping syscalls clus-tering, of improving the performance of the classical Baum-Welchalgorithm and solving the issue of HMM hidden space cardinal-ity selection [Rabiner, ]. From our experiments, in the caseof the simple traces like the ones found in the IDEVAL dataset,the used model suffice so we are working on more complex scenar-ios (i.e., new real datasets) to improve in a more grounded way theproposed algorithm.

is type of model is resistant to the presence of a limited num-ber of outliers (e.g. abruptly terminated executions, or attacks) inthe training set, because the resulting transition probabilities willdrop near zero. For the same reason, it is also resistant to the pres-ence of any cluster of anomalous invocations created by the cluster-ing phase. erefore, the presence of a minority of attacks in the

.. Malicious System Calls Detection

..ex0.start .op24 .op12

.op3

.op45

.re0 .se0

.e0

. .

..

..

.. .

.

..

.

..

F .: Example of Markov model.

training set will not adversely affect the learning phase, which inturn does not require then an attack-free training set.

e final result can be similar to what is shown in Figure ..In future extensions, the Markov model could then be simplified bymerging some states together: this would transform it in a HiddenMarkov Model, where the state does not correspond directly to asystem call cluster, but rather to a probability distribution of possi-ble outcomes. However, the use of HMMs complicates both train-ing and detection phases [Rabiner, ]. From our experiments,in the case of the simple traces like ones found in the IDEVALdataset, this step is unnecessary.

... Detection Phase

For detection, three distinct probabilities can be computed for eachexecuted system call: the probability of the execution sequence in-side the Markov model up to now, Ps; the probability of the systemcall to belong to the best-matching cluster, Pc; the last transitionprobability in the Markov model, Pm.

e latter two probabilities can be combined into a single “punc-tual” probability of the single system call, Pp = Pc · Pm, keepinga separate value for the “sequence” probability Ps. In order to setappropriate threshold values, we use the training data, compute thelowest probability over all the dataset for that single program (bothfor the sequence probability and for the punctual probability), andset this (eventually modified by a tolerance value) as the anomaly

. H- A D

threshold. e tolerance can be tuned to trade off DR for FPR.During detection, each system call is considered in the context

of the process. e cluster models are once again used to classifyeach system call into the correct cluster as explained above: there-fore (Pc = max(

∏i∈M Pi)). Ps and Pm are computed from the

Markov model, and require our system to keep track of the currentstate for each running process. If either Ps or Pp = Pc · Pm arelower than the anomaly threshold, the process is flagged as mali-cious.

Note .. (Probability Scaling) Given an l-long sequence of sys-tem calls, its sequence probability is Ps(l) =

∏li=0 Pp(i) where

Pp(i) ∈ [0, 1] is the punctual probability of the i-th system call inthe sequence. erefore, it is self-evident that liml→+∞ Ps(l) = 0.Experimentally, we observed that the sequence probability quicklydecreases to zero, even for short sequences (on the IDEVALdataset,we found that Ps(l) ≃ 0 for l ≥ 25). is leads to a high numberof false positives, since many sequences are assigned probabilitiesclose to zero (thus, always lower than any threshold value).

To overcome this shortcoming, we implemented two probabil-ity scaling functions, both based on the geometric mean. As a firstattempt, we computed Ps(l) =

l

√∏li=1 Pp(i), but in this case

P[

liml→+∞

Ps(l) = e−1

]= 1.

Proof .. LetG(Pp(1), . . . , Pp(l)) =l

√∏li=1 Pp(i) the geometric

mean of the sample Pp(1), . . . , Pp(l). If we assume that the sample isgenerated by a uniform distribution (i.e., Pp(1), . . . , Pp(l) ∼ U(0, 1))then logPp(i) ∼ E(β = 1) ∀i = 1, . . . , l: this can be proven byobserving that the Cumulative Density Function (CDF) [Pestman,] of logX (with X = Pp ∼ U(0, 1)) equals the CDF of anexponentially distributed variable with β = 1.

e arithmetic mean A(·) of the sample

− log(Pp(1)), . . . ,− log(Pp(l))

.. Malicious System Calls Detection

converges (in probability) to β = 1 for l → +∞, that is:

P[

liml→+∞

1

l

l∑i=1

− logPp(i) = −β = −1

]= 1

because of the strong law of large numbers. Being the geometricmean:

G(Pp(1), . . . , Pp(l)) =

(l∏

i=1

Pp(i)

) 1l

= e1l

∑li=1 − logPp(i)

we have

P[

liml→+∞

(e

1l

∑li=1 − logPp(i)

)= e−β = e−1

]= 1

is is not our desired result, so we modified this formula tointroduce a sort of “forgetting factor”: Ps(l) =

2l

√∏li=1 Pp(i)i. In

this case, we can prove that P [liml→+∞ Ps(l) = 0] = 1.

Proof .. e convergence of Ps(l) to zero can be proven by observ-ing that:

Ps(l) =

l

√√√√ l∏i=1

Pp(i)i

12

=(e

1l

∑li=1 i·logPp(i)

) 12

=

=(e

1l

∑lj=0(

∑l−ji=1 logPp(i))

) 12

=

=(e∑l

j=0( 1l

∑li=1 logPp(i)− 1

l

∑li=l−j+1 logPp(i))

) 12

Because of the previous proof, we can write that:

P[

liml→+∞

1

l

l∑i=1

logPp(i) = −1

]= 1

. H- A D

-12

-10

-8

-6

-4

-2

0

10 20 30 40 50 60

Log

of a

ssig

ned

prob

abilit

y

Sequence length (#calls)

ScaledOriginal

F .: Measured sequence probability (log) vs. sequencelength, comparing the original calculation and the second variantof scaling.

We can further observe that, being logPp(i) < 0:

∀j, l > 0 :

l∑i=1

logPp(i) <

l∑i=l−j+1

logPp(i)

therefore, the exponent is a sum of infinite negative quantities lesserthan 0, leading us to the result that, in probability

liml→+∞

(e∑l

j=0( 1l

∑li=1 logPp(i)− 1

l

∑li=l−j+1 logPp(i))

) 12

=

limx→−∞

ex = 0

Even if this second variant once again makes Ps(l) → 0 (inprobability), our experiments have shown that this effect is muchslower than in the original formula: Ps(l) ≃ 0 for l ≥ 300 (vs.l ≥ 25 of the previous version), as shown in Figure .. In fact, thisscaling function also leads to much better results in terms of FPR(see Section ..).

Instead of using probabilities — that are sensitive to the se-quence length — a possible alternative, which we are currently ex-

.. Malicious System Calls Detection

ploring, is the exploitation of distance metrics between Markovmodels [Stolcke andOmohundro, b, c,b] to define robustcriteria to compare new and learned sequence models. Basically,the idea is to create and continuously update a Markov model asso-ciated to the program instance being monitored, and to check howmuch such a model differs from the ones the system has learnedfor the same program. is approach is complementary to the oneproposed above, since it requires long sequences to get a properMarkov model. So, the use of both criteria (sequence likelihood inshort activations, and model comparison in longer ones) could leadto a reduction of false positives on the sequence model.

.. Prototype implementation

We implemented the above described system into a two-stage, highlyconfigurable, modular architecture written in ANSI C. e high-level structure is depicted in Figure .: the system is plugged intothe Linux auditd. e Cluster Manager is in charge of the cluster-ing phase while the Markov Model Manager implements Markovmodeling features. Both the modules are used in both trainingphase and detection phase, as detailed below.

e Cluster Manager module implements abstract clusteringprocedures along with abstract representation and storage of gener-ated clusters. e Markov Model Manager is conceptually similarto Cluster Manager: it has a basic Markov chains implementation,along with ancillary modules for model handling and storage.

During training, Cluster Manager is invoked to create the clus-ters on a given system call trail whileMarkovModelManager infersthe Markov model according to the clustering. At running time,for each system call, the Cluster Manager is invoked to find the ap-propriate cluster; the Markov Model Manager is also used to keeptrack of the current behavior of themonitored process: if significantdeviations are found alerts are fired and logged accordingly.

e system can output to both standard output, syslog facilitiesand IDMEF files. Both the clustering phase and the behavioralanalysis are multi-threaded and intermediate results of both proce-dures can be dumped in eXtensible Markup Language (XML).

. H- A D

..auditd.execve(args**)<syscall>(args**)

...exit()

.ClusterManager

.MarkovModel

Manager

.syslogd

.<IDMEF />

. KernelAuditing

. SyscallExtraction

. SyscallClassification

. BehaviorModeling

. Alerting

F .: e high-level structure of our prototype.

.. Experimental Results

In this section, we both compare the detection accuracy of our pro-posal and analyze the performances of the running prototype wedeveloped. Because of the known issues of IDEVAL (plus our find-ings summarized in Section ..), we also collected fresh trainingdata and new attacks to further prove that our proposal is promisingin terms of accuracy.

As we detailed in Section ..., our system can be tuned toavoid overfitting; in the current implementation, such parameterscan be specified for each system call, thus in the following we re-port the bounds of variations instead of listing all the single values:dstop,num ∈ {1, 2, 3}, dstop,min = {6, 10, 20, 60}.

... Detection accuracy

For the reasons outlined above and in Section .., as well forthe uncertainty outlined in Section .., we did not rely on purelynumerical results on DR or FPR. Instead, we compared the resultsobtained by our software with the results of SyscallAnomaly in theterms of a set of case studies, comparing them singularly. Whatturned out is that our software has two main advantages over LibA-nomaly:

• a better contextualization of anomalies, which lets the sys-tem detect whether a single syscall has been altered, or if asequence of calls became anomalous consequently to a suspi-cious attack;

• a strong characterization of subgroups with closer and morereliable sub-models.

.. Malicious System Calls Detection

HMM S execve0 (START ⇒ execve0)filename /usr/bin/fdformat

argv fdformat\0x20\0x20\0x20\0x20[...]Pc 0.1Pm 1

Pp (thresh.) 0.1 (1)HMM S open10 (open2 ⇒ open10)

pathname /usr/lib/locale/iso 8859 1/[...]flags -r-xr-xr-x

mode 33133

Pc 5 · 10−4

Pm 0Pp (thresh.) 0 (undefined)

HMM S open11 (open10 ⇒ open11)pathname /devices/pseudo/vol@0:volctl

flags crw-rw-rw-mode 8630

Pc 1Pm 0

Pp (thresh.) 0 (undefined)HMM S chmod (open11 ⇒ chmod)

pathname /devices/pseudo/vol@0:volctlflags crw-rw-rw-

mode 8630

Pc 0.1Pm 0

Pp (thresh.) 0 (undefined)HMM S exit0 (chmod ⇒ exit0)

status 0

Pc 1Pm 0

Pp (thresh.) 0 (undefined)

Table .: fdformat: attack and consequences

. H- A D

As an example of the first advantage, let us analyze again theprogram fdformat, which was already analyzed in Section ... Ascan be seen from Table ., our system correctly flags execve asanomalous (due to an excessive length of input). It can be seen thatPm is (the system call is the one we expected), but the models ofthe syscall are not matching, generating a very low Pc. e local-ization file opening is also flagged as anomalous for two reasons:scarce affinity with the model (because of the strange filename),and also erroneous transition between the open subgroups open2and open10. In the case of such an anomalous transition, thresholdsare shown as “undefined” as this transition has never been observedin training. e attack effect (chmod and the change of permissionson /export/home/elmoc/.cshrc) and various intervening syscalls arealso flagged as anomalous because the transition has never been ob-served (Pm = 0); while reviewing logs, this also helps us in under-standing whether or not the buffer overflow attack has succeeded.A similar observation can be done on the execution of chmod on/etc/shadow ensuing an attack on eject.

In the case of ps, our system flags the execve system call, asusual, for excessive length of input. File /tmp/foo is also detectedas anomalous argument for open. In LibAnomaly, this happened justbecause of the presence of an underscore, and was easy to bypass. Inour case, /tmp/foo is compared against a sub-cluster of open whichcontains only the /tmp/ps data, and therefore will flag as anoma-lous, with a very high confidence, any other name, even if struc-turally similar. A sequence of chmod syscalls executed inside di-rectory /home/secret as a result of the attacks are also flagged asanomalous program flows.

Limiting the scope to the detection accuracy of our system, weperformed several experiments with both eject and bsdtar, andwe summarize the results in Table .. e prototype has beentrained with ten different execution of eject and more than a hun-dred executions of bsdtar. We then audited eight instances of theactivity of eject under attack, while for bsdtar we logged seven ma-licious executions. We report DRs and FPRs with (Y) and without(N) the use of Markov models, and we compute FPRs using cross-validation through the data set (i.e., by training the algorithm ona subset of the dataset and subsequently testing the other part ofthe dataset). Note that, to better analyze the false positives, we ac-counted for both false positive sequences (Seq.) and false positive

.. Malicious System Calls Detection

DR FPRGranularity: Sequence Sequence Call

Markov model bsdtar

Y 100 1.6 0.1N 88 1.6 0.1

eject

Y 100 0 0N 0 0 0

Table .: DRs and FPRs on two test programs, with (Y) andwithout (N) Markov models.

system calls (Call).In both cases, using the complete algorithm yield a DR

with a very low FPR. In the case of eject, the exploit is detected inthe very beginning: since a very long argument is passed to the ex-ecve, this triggers the argument model. e detection of the shell-code we injected exploiting the buffer overflow in bsdtar is identi-fied by the open of the unexpected (special) file /dev/tty. Note that,the use of thresholds calculated on the overall Markov model allowsus to achieve a DR in the case of eject; without the Markovmodel, the attack wouldn’t be detected at all.

It is very difficult to compare our results directly with the othersimilar systems we identified in Section ... In [Tandon andChan, ] the evaluation is performed on the Defense AdvancedResearch Projects Agency (DARPA) dataset, but DRs and FPRs arenot given (the number of detections and false alarms is not nor-malized), so a direct comparison is difficult. Moreover, detectionis computed using an arbitrary time window, and false alerts areinstead given in “alerts per day”. It is correspondingly difficult tocompare against the results in [Bhatkar et al., ], as the evalua-tion is ran over a dataset which is not disclosed, using two programsthat are very different from the ones we use, and using a handfulof exploits chosen by the authors. Different scalings of the falsepositives and DRs also make a comparison impossible to draw.

. H- A D

0

20

40

60

80

100

0 2 4 6 8 10 12

%D

R

%FPR

OriginalFirst variant

Second variant

F .: Comparison of the effect on detection of differentprobability scaling functions.

As a side result, we tested the detection accuracy of the two scal-ing functions we proposed for computing the sequence probabilityPs. As shown in Figure ., the first and the second variant bothshow lower FPR w.r.t. to the original, unscaled version.

Note .. (Configuration parameters) Althoughwe performed allthe experiments with different values of dstop,min and dstop,num,we report only the best detection accuracy, achieved with the de-fault settings: dstop,min = 10, dstop,num = 3.

ese parameters influence the quality of the clustering andthus may need to be changed if some particular, pathological falsepositives need to be eliminated in a specific setting. In this case,human intervention is required; in particular, it is required to runthe tool on a small dataset collected on a live system and check theclustering for outliers. If clusters containing too many outliers ex-ist, then dstop,min may be increased. dstop,num should be increasedonly if outliers cannot be eliminated.

.. Malicious System Calls Detection

... Performance measurements

An IDS should not introduce significant performance overheadsin terms of the time required to classify events as malicious (ornot). An IDS based on the analysis of system calls has to interceptand process every single syscall invoked on the operating system byuserspace applications; for this reason, the fastest a system call isprocessed, the best. We profiled the code of our system with gprofand valgrind for CPU and memory requirements. We ran the IDSon data drawn from the IDEVAL dataset (which is sufficientfor performance measurements, as in this case we are only inter-ested in the throughput and not in realistic DRs).

In Table . we reported the measurement of performanceon the five working days of the first week of the dataset for train-ing, and of the fourth week for testing. e throughput X variesduring training between and syscalls per second. eclustering phase is the bottleneck in most cases, while the Markovmodel construction is generally faster. Due to the clustering step,the training phase is memory consuming: in the worst case, werecorded a memory usage of about MB. e performance ob-served in the detection phase is of course even more important: inthis case, it varies between and syscalls/sec. Consid-ering that the kernel of a typical machine running services such asHTTP/FTP on average executes system calls in the order of thou-sands per second (e.g., around system calls per second for wu-ftpd [Mutz et al., ]), the overhead introduced by our IDS isnoticeable but does not severely impact system operations.

. H- A D

. Mixing Deterministic and Stochastic Models

In this section, we describe an improvement to the syscall-basedanomaly detection system described in Section ., which incor-porates both the deterministic models of what we called FSA-DF,detailed in [Bhatkar et al., ] (that we analyzed in detail inSection ..), and the stochastic models of SADE . In [Frossiet al., ] we propose a number of modifications, described in thefollowing, that significantly improve the performance of both theoriginal approaches. In this section we begin by comparing FSA-DF with SADE and analyzing their respective performance interms of detection accuracy. en, we outline the major shortcom-ings of the two systems, and propose various changes in the mod-els that can address them. In particular, we kept the deterministiccontrol flow model of FSA-DF and substituted some determin-istic dataflow relations with their stochastic equivalent, using thetechnique described in Section .... is allowed us to main-tain an accurate characterization of the process control flow and, atthe same time, to avoid many FP due to the deterministic relationsamong arguments. More precisely, the original learning algorithmof FSA-DF can be schematized as follow:

∀c = ⟨syscalli−1, syscalli⟩ make_state(c, PC) learn_relations(c) equal elementOf subsetOf range hasExtension isWithinDir

contains hasSameDirAs hasSameBaseAs hasSameExtensionAs

We noticed that simple relations such as equal, elementOf, con-tains were not suitable for strings. In particular, they raise too manyFP also in obvious cases like “/tmp/php1553” vs. “/tmp/php9022”where the deterministic equal does not hold, but a smoother pred-icate could easily detect the common prefix, and thus group thetwo strings in the same class. is problem is clearly extended to

.. Mixing Deterministic and Stochastic Models

the other two relations that are based on string confrontation aswell. To mitigate these side-effects, we augmented the learningalgorithm as follows:

learn_string_domain(syscalli) ∀c = ⟨syscalli−1, syscalli⟩ make_state(c, PC) learn_relations(c) save_model subsetOf range hasExtension isWithinDir

hasSameDirAs hasSameBaseAs hasSameExtensionAs

where the pseudo-function learn string domain equals to the train-ing of the SOM as described in Section ..., and save modelplays the role the three relations mentioned above. Basically, itstores the Best Matching Unit (BMU) corresponding to the stringarguments of the current system call. e BMU is retrieved byquerying the SOM previously trained. In addition to this, we com-plemented the resulting system with two new models to cope withDoS attacks (see Section ...) and the presence of outlier in thetraining dataset (see Section ...).

We show how targeted modifications of their anomaly models,as opposed to the redesign of the global system, can noticeably im-prove the overall detection accuracy. Finally, the impact of thesemodifications are discussed by comparing the performance of thetwo original implementations with two modified versions comple-mented with our models.

.. Enhanced DetectionModels

e improvements we made focus on path and execution arguments.A new string lengthmodel is added exploiting aGaussian interval asdetailed in Section .... e new edge frequencymodel describedin Section ... have been added to detect DoS attacks. Also,in Section ... we describe how we exploited SOM to modelthe similarity among path arguments. e resulting system, Hybrid

. H- A D

IDS incorporates the models of FSA-DF and SADE along withthe aforementioned enhancements.

... Arguments Length Using Gaussian Intervals

e model for system call execution arguments implemented inSADE takes into account the minimum and maximum lengthof the parameters found during training, and checks whether eachstring parameter falls into this range (model probability ) or not(model probability ). is technique allows to detect common at-tempts of buffer overflow through the command line, for instance,as well as various other command line exploits. However, suchcriteria do not model “how different” two arguments are to eachothers; a smoother function is more desirable. Furthermore, thefrequency of each argument in the training set is not taken intoaccount at all. Last but not least, the model is not resilient to thepresence of attacks in the training set; just one occurrence of a ma-licious string would increase the length of the maximum intervalallowing argument of almost every length.

e improved version of the interval model uses a Gaussian dis-tribution for modeling the argument length Xargs = |args|, esti-mated from the data in terms of sample mean and sample vari-ance. e anomaly threshold is a percentile Targs centered on themean. Arguments which length is outside the stochastic intervalare flagged as anomalous. is model is resilient to the presence ofoutliers in the dataset. e Gaussian distribution has been chosensince is the natural stochastic extension of a range interval for thelength. An example is shown in Figure ..

Model Validation During detection the model self-assesses itsprecision by calculating the kurtosismeasure [Joanes andGill, ],defined as γX = E4(X)

Var(X) . in tailed distributions with a low peakaround the mean exhibit γX < 0 while positive values are typical offat tailed distributions with an acute peak. We used γX =

µX,4

σ4X

−3

to estimate γX . us, if γXargs< 0means that the sample is spread

on a big interval, while positive the values indicates a less “fuzzy” setof values. It is indeed straightforward that highly negative valuesindicates not significant estimations as the interval would include

.. Mixing Deterministic and Stochastic Models

0

0.005

0.01

0.015

0.02

0.025

0.03

-20 0 20 40 60 80

P(X

)

x

Norm(29.8, 184.844)Thresholds: [12.37, 47.22]

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 5 10 15 20 25 30

P(X

)

x

Norm(19.25, 1.6875)Thresholds: [16.25, 22.25]

F .: Sample estimated Gaussian intervals for stringlength. Training data of sudo (left) and ftp (right) wasused. N (29.8, 184.844), thresholds [., .] (left) andN (19.25, 1.6875), thresholds [., .] (right).

almost all lengths. In this case, the model falls back to a simpleinterval.

... DoS Detection Using Edge Traversal Frequency

DoS attacks which force the process to get stuck in a legal section ofthe normal control flow could be detected by SADE as violationsof the Markov model, but not by FSA-DF. On the other hand,the statistical models implemented in SADE are more robustbut have higher False Negative Rates (FNRs) than the determinis-tic detection implemented in FSA-DF. However, as already statedin Section ., the cumulative probability of the traversed edgesworks well only with execution traces of similar and fixed length,otherwise even the rescaled score decreases to zero, generating falsepositives on long traces.

To solve these issues a stochastic model of the edge frequencytraversal is used. For each trace of the training set, our algorithmcounts the number of edge traversals (i.e., Markov model edge orFSA edge). e number is then normalized w.r.t. all the edgesobtaining frequencies. Each edge is then associated to the sampleXedge = x1, x2, . . . . We show that the random samples Xedge iswell estimated using a Beta distribution. Figure . shows sampleplots of this model estimated using the mt-daapd training set; thequantiles associated to the thresholds are computed and shown aswell. As we did for the Gaussian model, the detection thresholdsare defined at configuration time as a percentile Tedge centered on

. H- A D

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

0.40 0.45 0.50 0.55 0.60 0.65 0.70

P(X)

x

(a)

0.00

5.00

10.00

15.00

20.00

25.00

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14

P(X)

x

(b)

F .: Two different estimations of the edge frequencydistribution. Namely, Beta(., .) with thresh-olds [., .] (left) and Beta(.,.) withthresholds [., .] (right).

the mean (Figure .). We chose the Beta for its high flexibility; aGaussian is unsuitable to model skewed phenomena.

ModelValidation Our implementation is optimized to avoid over-fitting and meaningless estimations. A model is valid only if thetraining set includes a significant (e.g., |mini{xi} − maxi{xi}| ≥δxmin = 0.04) amount (Nmin

edge = 6) of paths. Otherwise it con-struct a simpler frequency rangemodel. emodel exhibits the sideeffect of discarding the extreme values found in training and leadsto erroneous decisions. More precisely, if the sample is Xedge =1, 1, . . . , 0.9, 1, the right boundary will never be exactly 1, and there-fore legal values will be discarded. To solve this issue, the quantilesclose to 1 are approximated to 1 according to a configuration pa-rameter Xcut. For instance, if Xcut = 3 the quantileFX(·) = 0.999is approximated to 1.

... Path Similarity Using Self OrganizingMaps

Path argument models are already implemented in SADE andFSA-DF. Several, general-purpose string comparison techniqueshave been proposed so far, especially in the field of database systemsand data cleansing [Elmagarmid et al., ]. We propose a solu-tion based on Symbol-SOMs [Somervuo, ] to define an ac-

.. Mixing Deterministic and Stochastic Models

curate distance metric between paths. Symbol SOM implements asmooth similarity measure otherwise unachievable using common,crisp distance functions among strings (e.g., edit distance).

e technique exploits SOMs, which are unsupervised neu-ral algorithms. A SOM produces a compressed, multidimensionalrepresentation (usually a bi-dimensional map) of the input spaceby preserving the main topological properties. It is initialized ran-domly, and then adapted via a competitive and cooperative learn-ing process. At each cycle, a new input is compared to the knownmodels, and the BMU node is selected. e BMU and its neigh-borhood models are then updated to make them better resemblefuture inputs.

We use the technique described in [Kohonen and Somervuo,] to map strings onto SOMs. Formally, let

St = [st(1) · · · st(L)]denote the t-th string over the alphabet A of size |A|. Each

symbol st(i), i = 1 . . . L, is then encoded into a vector st(i) ofsize |A| initialized with zeroes except at the w-th position whichcorresponds to the index of the encoded symbol (e.g., st(i) = ‘b′would be st(i) = [0 1 0 0 · · · 0]T , w = 2). us, each string St isrepresented with sequence of L vectors like st(i), i.e. a L × |A|-matrix: S

t.

Let Stand M

kdenote two vector-encoded strings, where M

kis the model associated with SOM node k. e distance betweenthe two strings is D′(St,Mk) = D(S

t,M

k). D(·, ·) is also de-

fined in the case of LSt = |St| = |Mk| = LMkrelying on dy-

namic time warping techniques to find the best alignment betweenthe two sequences before computing the distance. Without go-ing into details, the algorithm [Somervuo, ] aligns the twosequences st(i) ∈ S

t,mk(j) ∈ M

kusing a mapping [st(i),mk(j)]

7→ [st(i(p)),mk(j(p))] defined through the warping function F :[i, j] 7→ [i(p), j(p)]:

F = [[i(1), j(1)], . . . , [i(p), j(p)], . . . , [i(P ), j(P )]].

e distance function D is defined over the warping alignmentof size P , D(S

t,M

k) =

∑Pp=1 d(i, j), which is P = LSt = LMk

if the two strings have equal lengths. More precisely:

. H- A D

d(i, j) = d(i(p), j(p))||st(i(p))−mk(j(p))||.

e distance is defined upon gi,j = g(i, j), the variable whichstores the cumulative distance in each trellis point (i, j) = (i(p), i(p)).e trellis is first initialized to 0 in (0, 0), to +∞ for both (0, ·) and(·, 0), otherwise:

g(i, j) = min

g(i, j − 1) + d(i, j)g(i− 1, j − 1) + d(i, j)g(i− 1, j) + d(i, j)

Note that i ∈ [1, LSt ] and j ∈ [1, LMk] thus the total distance

is D(St,M

k) = g(LSt , LMk

). A simple example of distance com-putation is show in Figure . (A is the English alphabet plus extracharacters). e overall distance is D′(St,Mk) = 8.485. We useda symmetric Gaussian neighborhood function h whose center is lo-cated at the BMU c(t). More precisely

h(k, c(t), t) = α(t)e− d(c(t),k)

2σ2(t)

whereα(t) controls the learning rate and σ(t) is the actual widthof the neighborhood function. e SOM algorithm uses two train-ing cycles. During () adaptation the map is more flexible, whileduring () tuning the learning rate α(·) and the width of the neigh-borhood σ(·) are decreased. On each phase such parameters aredenoted as α1, α2, σ1, σ2.

Symbol SOMs are “plugged” into FSA-DF by associating eachtransition with the set of BMUs learned during training. At detec-tion, an alert occurs whenever a path argument falls into neighbor-hood of a non-existing BMU. Similarly, in the case of SADE, the neighborhood function is used to decide whether the stringis anomalous or not, according to a proper threshold which is theminimum value of the neighborhood function encountered duringtraining, for each node.

.. Experimental ResultsIn this section we describe our efforts to cope with the lack of re-liable testing datasets for intrusion detections. e testing method-ology is here detailed alongwith the experiments we designed. Both

.. Mixing Deterministic and Stochastic Models

/ b i n / s h

0 +∞ +∞ +∞ +∞ +∞ +∞ +∞/ +∞ 0 1.414 2.828 4.242 4.242 5.656 7.071v +∞ 1.414 1.414 2.828 4.242 5.656 5.656 7.071a +∞ 2.828 2.828 2.828 4.242 5.656 7.071 7.071r +∞ 4.242 5.656 5.656 5.656 4.242 5.656 7.071/ +∞ 4.242 4.242 4.242 4.242 5.656 7.071 8.485l +∞ 5.656 5.656 7.071 7.071 5.656 5.656 7.071o +∞ 7.071 7.071 7.071 8.485 7.071 7.071 7.071g +∞ 8.485 8.485 8.485 8.485 8.485 8.485 .

F .: Distance computation example.

detection accuracy and performance overhead are subjects of ourtests.

In order to evaluate and highlight the impact of each specificmodel, we performed targeted tests rather than reporting generalDRs and FPRs only. Also, we ensured that all possible alerts typesare inspected (i.e., true/false positive/negative). In particular, foreach IDS, we included one legal trace in which file operations areperformed on files never seen during training but with a similarname (e.g., training on /tmp/log, testing on /tmp/log2); secondly,we inserted a trace which mimics an attack.

... Comparison of Detection Accuracy

edetection accuracy ofHybrid IDS (H), FSA-DF (F) and SADE(S) is here analyzed and compared. Both training parameters anddetection results are summarized in Table .. e parametersused to train the SOM are fixed except for σ1(t): α1(t) = 0.5 ÷0.01, σ2(t) = 3 and α2(t) = 0.1 ÷ 0.01. Percentiles for bothXargs and Xedge are detailed. e “paths/cycle” (paths per cy-cle) row indicates the amount of paths arguments used for trainingthe SOM. e settings for clustering stage of SADE are con-stant: minimum number of clusters (3, or 2 in the case of the open);maximum merging distance (6, or 10 in the case of the open); the“null” and the “don’t care” probability values are fixed at 0.1 and 10,

. H- A D

T Ses. Calls Progr. t (Clust., HMM) [s] X [call/s] . (., .) . (., .) . (., .) . (., .) . (., .) Avg. processing time 1.2910 · 10−4

D Ses. Calls Progr. t [s] X [call/s] . . . . . Avg. processing time 5.6581 · 10−5

Table .: Training and detection throughput X . On the samedataset, the average processing time with SADE disabled isaround 0.08741084, thus, in the worst case, SADE introducesa 0.1476 overhead (estimated).

sing mt-daapd proftpd sudo BitchX

SOM × × × × × Traces

Syscalls Paths

Paths/cycle

Table .: Parameters used to train the IDSs. Values includes thenumber of traces used, the amount of paths encountered and thenumber of paths per cycle.

.. Mixing Deterministic and Stochastic Models

respectively, while 10 is the maximum number of leaf clusters. Inorder to give a better understanding of how each prototype works,we analyzed by hand the detection results on each target applica-tion.

sing: Hybrid IDS is not tricked by the false positive mimic traceinserted. e Symbol SOM model recognizes the similar-ity of /tmp/log3 with the other paths inserted in the training.Instead, both FSA-DF and SADE raise false alarms; theformer has never seen the path during training while the lat-ter recognizes the string in the tree path model but an alarmis raised because of threshold violation. SADE recognizesthe attack containing the longer subsequent invocations ofmmap2; FSA-DF also raises a violation in the file name becauseit has never been trained against /etc/passwd nor /etc/shadow;and Hybrid IDS is triggered because the paths are placed ina different SOM region w.r.t. the training.

mt-daapd: e legit traces violate the binary and unary relationscausing several false alarms on FSA-DF. On the other hand,the smoother path similarity model allows Hybrid IDS andSADE to pass the test with no false positives. e changesin the control flow caused by the attacks are recognized byall the IDSs. In particular, the DoS attack (special-craftedrequest sent fifty times) triggers an anomaly in the edge fre-quency model.

proftpd: e legit trace is correctly handled by all the IDSs as wellas the anomalous root shell that causes unexpected calls (setuid,setgid and execve) to be invoked. However, FSA-DF flagsmore than benign system calls as anomalous because oftemporary files path not present in the training.

sudo: Legit traces are correctly recognized by all the engines andattacks are detected without errors. SADE fires an alertbecause of a missing edge in the Markov model (i.e., the un-expected execution of chown root:root script and chmod +sscript). Also, the absence of the script string in the trainingtriggers a unary relation violation in FSA-DF and a SOMviolation in Hybrid IDS. e traces which mimic the attack

. H- A D

are erroneously flagged as anomalous, because the system callsequences are strictly similar to the attack.

BitchX: e exploit is easily detected by all the IDSs as a controlflow violation through extra execve system calls are invoked toexecute injected commands. Furthermore, the Hybrid IDSanomaly engine is triggered by three edge frequency viola-tions due to paths passed to the FSA when the attack is per-formed which are different w.r.t. the expected ones.

... Specific Comparison of SOM-SADE and SADE

We also specifically tested how the introduction of a Symbol SOMimproves over the original probabilistic tree used for modeling thepath arguments in SADE . Training parameters are reported inTable . while results are summarized in Table ., the FPRdecreases in the second test. However, the first test exhibits a lowerFNR as detailed in the following.

e mcweject utility is affected by a stack overflow CVE-- caused by improper bounds checking. Root privileges canbe gained if mcweject is setuid. Launching the exploit is as easy aseject -t illegal payload, but we performed it through the userlandexecution technique we describe in Section .. to make it morestealthy avoiding the execve that obviously triggers an alert in theSADE for a missing edge in the Markov chain. Instead, we areinterested in comparing the string models only. SOM-SADEdetects it with no issues because of the use of different “types” ofpaths in the opens.

An erroneous computation of a buffer length is exploited to ex-ecute code via a specially crafted PAX archives passed to bsdtar(CVE--). A heap overflow allows to overwrite a structurepointer containing itself another pointer to a function called rightafter the overflow. e custom exploit [Maggi et al., ] basicallyredirects that pointer to the injected shellcode. Both the originalstring model and the Symbol SOM models detect the attack whenan unexpected special file (i.e., /dev/tty) is opened. However, theoriginal model raises many false positives when significantly differ-ent paths are encountered. is situation is instead handled withno false positives by the smooth Symbol SOM model.

.. Mixing Deterministic and Stochastic Models

sing mt-daapd profdtpd sudo BitchX

Traces Syscalls

SADE . . .FSA-DS . . . .

Hybrid IDS . . .

Table .: Comparison of the FPR of SADE vs. FSA-DFvs. Hybrid IDS. Values include the number of traces used. Accu-rate description of the impact of each individual model is in Sec-tion ...

mcweject bsdtar

SOM × × Traces

Syscalls Paths

Paths/cycle

Table .: Parameters used to train the IDSs. Values includes thenumber of traces used, the amount of paths encountered and thenumber of paths per cycle.

mcweject bsdtar

Traces Syscalls

SADE . .SOM-SADE . .

Table .: Comparison of the FPR of SADE vs. SOM-SADE . Values include the number of traces used.

. H- A D

sing sudo BitchX mcweject bsdtar

System calls Speed

SADE . . . . . FSA-DF . . . - -

Hybrid IDS . . - - SOM-SADE - - - .

Table .: Detection performance measured in “seconds per sys-tem call”. e average speed is measured in system calls per second(last column).

Since this dataset has been originally prepared for [Maggi et al.,, a], details on its generation are described in Section ..

... Performance Evaluation and Complexity Discussion

We performed both empirical measurements and theoretical anal-ysis of the performance of the various proposed prototypes. De-tection speed results are summarized in Table .. e datasetsfor detection accuracy were reused: we selected the five test appli-cations on which the IDSs performed worst. Hybrid IDS is slowbecause the BMU algorithm for the symbol SOM is invoked foreach system call with a path argument (opens are quite frequent),slowing down the detection phase. Also, we recall that the currentprototype relies on a system call interceptor based on ptrace whichintroduces high runtime overheads, as shown in [Bhatkar et al., ].To obtain better performance, an in-kernel interceptor could beused. e theoretical performance of each engine can be estimatedby analyzing the bottleneck algorithm.

... Complexity of FSA-DF

During training, the bottleneck is the binary relation learning al-gorithm. T train

F = O(S · M + N), where M is the total numberof system calls, S = |Q| is the number of states of the automaton,and N is the sum of the length of all the string arguments in thetraining set. At detection T det

FSA−DF = O(M +N).Assuming that each system call has O(1) arguments, the train-

ing algorithm is invokedO(M) times. e time complexity of each

.. Mixing Deterministic and Stochastic Models

i-th iteration is Yi + |Xi|, where Yi is the time required to com-pute all the unary and binary relations and |Xi| indicates the timerequired to process the i− th system call X . us, the overall com-plexity is bounded by

∑Mi=1 Y + |Xi| = M · Y +

∑Mi=1 |Xi|. e

second factor∑M

i=1 |Xi| can be simplified to N because strings arerepresented as a tree; it can be shown [Bhatkar et al., ] that thetotal time required to keep the longest common prefix informationis bounded by the total length of all input strings. Furthermore, Yis bounded by the number of unique arguments, which in turn isbounded by S; thus, T train

F = O(S ·M +N). is also prove thetime complexity of the detection algorithm which, at each state andfor each input, requires unary and binary checks to be performed;thus, its cost is bounded by M +N . �

... Complexity of Hybrid IDS

In the training phase, the bottleneck is the Symbol SOM creationtime: T train

H = O(C · D · (L2 + L)), where C is the number oflearning cycles, D is the number of nodes, and L is the maximumlength of an input string. At detection time T det

H = O(M ·D ·L2).T trainH depends on both the number of training cycles, the BMU

algorithm and node updating. e input is randomized at eachtraining session and a constant amount of paths is used, thus theinput size isO(1). e BMU algorithm depends on both the SOMsize and the distance computation, bounded byLinput·Lnode = L2,where Linput and Lnode are the lengths of the input string and thenode string, respectively. More precisely, the distance betweenstrings is performed by comparing all the vectors representing, re-spectively, each character of the input string and each characterof the node string. e char-by-char comparison is performed inO(1) because the size of each character vector is fixed. us, thedistance computation is bounded by L2 ≃ Linput · Lnode. enode updating algorithm depends on both the number of nodesD, the length of the node string Lnode and the training cycles C,hence each cycle requires O(D · (L2 + L)), where L is the lengthof the longest string. e creation of the FSA is similar to theFSA-DF training, except for the computation of the relations be-tween strings which time is no longer O(N) but it is bounded byM ·D · L2 (i.e., the time required to find the Best Matching Unit

. H- A D

for one string). us, according to Proof , this phase requiresO(S ·M +M ·D ·L2) < O(C ·D · (L2 +L)). e detection timeT detH is bounded by the BMU algorithm, that is O(M ·D · L2). �

e clustering phase of SADE is O(N2) while with SOM-SADE it grows to O(N2L2).

In the worst case, the clustering algorithm used in [Maggi et al.,a] is known to be O(N2), where N is the number of sys-tem calls: the distance function is O(1) and the distance matrix issearched for the two closest clusters. In the case of SOM-SADE, the distance function is insteadO(L2) as it requires one run of theBMU algorithm. �

.. Forensics Use of Anomaly Detection Techniques

. Forensics Use of Anomaly Detection Techniques

Anti-forensics is the practice of circumventing classical forensicsanalysis procedures, making them unreliable or impossible. In thissection we describe how machine learning algorithms and anomalydetection techniques can be exploited to cope with a wide classof definitive anti-forensics techniques. In [Maggi et al., ] wetested SADE , described in Section . including the improve-ments detailed in Section ., on a dataset we created through theimplementation of an innovative technique of anti-forensics, andwe show that our approach yields promising results in terms of de-tection.

Computer forensics is usually defined as the process of applyingscientific, repeatable analysis processes to data and computer sys-tems, with the objective of producing evidence that can be used inan investigation or in a court of law. More in general, it is the set oftechniques that can be applied to understand if, and how, a systemhas been used or abused to commit mischief [Mohay et al., ].e increasing use of forensic techniques has led to the develop-ment of anti-forensic techniques that can make this process diffi-cult, or impossible [Garfinkel, ; Berghel, ; Harris, ].

Anti-forensics techniques can be divided into at least two groups,depending on their target. If the identification phase is targeted,we have transient anti-forensics techniques, which make the ac-quired evidence difficult to analyze with a specific tool or proce-dure, but not impossible to analyze in general. If instead the acqui-sition phase is targeted, we have the more effective class of definitiveanti-forensics techniques, which effectively deny once and foreverany access to the evidence. In this case, the evidence may be de-stroyed by the attacker, or may simply not exist on the media. isis the case of in-memory injection techniques that are described inthe following.

In particular, we propose the use of machine learning algo-rithms and anomaly detectors to circumvent such techniques. Notethat, referring once again to the aforementioned definition of com-puter forensics in [Mohay et al., ], we focused only on de-tecting if a system has been compromised. In fact, if definitiveanti-forensics techniques can make it impossible to detect how thesystem has been exploited. We illustrate a prototype of anomalydetector which analyzes the sequence and the arguments of sys-

. H- A D

tem calls to detect intrusions. We also use this prototype to detectin-memory injections of executable code, and in-memory execu-tion of binaries (the so-called “userland exec” technique, which were-implement in a reliable way). is creates a usable audit trail,without needing to resort to complex memory dump and analysisoperations [Burdach, ; Ring and Cole, ].

.. Problem statement

Anti-forensics is defined by symmetry on the traditional definitionof computer forensics: it is the set of techniques that an attackermay employ to make it difficult, or impossible, to apply scientificanalysis processes to the computer systems he penetrates, in orderto gather evidence [Garfinkel, ; Berghel, ; Harris, ].e final objective of anti-forensics is to reduce the quantity andspoil the quality [Grugq, ] of the evidence that can be retrievedby an investigation and subsequently used in a court of law.

Following the widely accepted partition of forensics [Pollitt,] in acquisition, identification, evaluation, and presentation, theonly two phases where technology can be critically sabotaged areboth acquisition and identification. erefore, we can define anti-forensics as follows.

Definition .. (Anti-forensics) Anti-forensics is a combination ofall the methods that make acquisition, preservation and analysis ofcomputer-generated and computer-stored data difficult, unreliableor meaningless for law enforcement and investigation purposes.

Even if more complex taxonomies have been proposed [Harris,], we can use the traditional partition of the forensic processto distinguish among two types of anti-forensics:

Transient anti-forensics when the identification phase is targeted,making the acquired evidence difficult to analyze with a spe-cific tool or procedure, but not impossible to analyze in gen-eral.

Definitive anti-forensics when the acquisition phase is targeted, ru-ining the evidence or making it impossible to acquire.

.. Forensics Use of Anomaly Detection Techniques

Examples of transient anti-forensics techniques are the fuzzingand abuse of file-systems in order to create malfunctions or to ex-ploit vulnerabilities of the tools used by the analyst, or the use of loganalysis tools vulnerabilities to hide or modify certain information[Foster and Liu, ; Grugq, ]. In other cases, entire file-systems have been hidden inside the metadata of other file-systems[Grugq, ], but techniques have been developed to cope withsuch attempts [Piper et al., ]. Other examples are the use ofsteganography [Johnson and Jajodia, ], or the modification offile metadata in order to make filetype not discoverable. In thesecases the evidence is not completely unrecoverable, but it may es-cape any quick or superficial examination of the media: a commonproblem today, where investigators are overwhelmed with cases andusually under-trained, and therefore overly reliant on tools.

Definitive anti-forensics, on the other hand, effectively deniesaccess to the evidence. e attackers may encrypt it, or securelydelete it fromfile-systems (this process is sometimes called “counter-forensics”) with varying degrees of success [Geiger, ; Garfinkeland Shelat, ]. Access times may be rearranged to hide thetime correlation that is usually exploited by analysts to reconstructthe events timeline. e final anti-forensics methodology is notto leave a trail: for instance, modern attack tools (commercial oropen source) such as Metasploit [Metasploit, ], Mosdef or CoreIMPACT [Core Security Technologies, ] focus on pivoting andin-memory injection of code: in this case, nothing or almost noth-ing is written on disk, and therefore information on the attack willbe lost as soon as it is powered down, which is usually standard op-erating procedure on compromised machines. ese techniques arealso known as “disk-avoiding” procedures.

Memory dump and analysis operations have been advocated inresponse to this, and tools are being built to cope with the com-plex task of the reliable acquisition [Burdach, ; Schatz, ]and analysis [Burdach, ; Ring and Cole, ; Nick L. Petroniet al., ] of a modern system’s memory. However, even in thecase that the memory can be acquired and examined, if the processinjected and launched has already terminated, once more, no tracewill be found of the attack: these techniques are much more use-ful against in-memory resident backdoors and rootkits, which bydefinition are persistent.

. H- A D

argv str

envp str

alignment

argc

envp []

alignment

argv []VictimAttacker

SELF loader ready

ELF's SP

Vulnerable code's stack

1. Exploit code + SELF payload

2. SELF auto-loader + arbitrary ELF

3. Arbitrary ELF response

F .: An illustration of the in-memory execution technique.

.. Experimental Results

Even in this evaluation, we used the dataset described in Section ..However, a slight modification of the generation mechanism wasneeded. More precisely, in the tests conducted we used a modi-fied version of SELF [Pluf and Ripe, ], which we improvedin order to reliably run under FreeBSD . and ported to a formwhich could be executed through code injection (i.e., to shellcodeformat). SELF implements a technique known as userland exec: itmodifies any statically linked Executable Linux Format (ELF) bi-nary and, by building a specially-crafted stack, it allows an attackerto load and run that ELF in the memory space of a target processwithout calling the kernel and, more importantly, without leavingany trace on the hard disk of the attacked machine. is is donethrough a two-stage attack where a shellcode is injected in the vul-nerable program, and then retrieves a modified ELF from a remotemachine, and subsequently injects it into the memory space of therunning target process, as shown schematically in Figure ..

Eight experiments with both eject and bsdtar were performed.Our anomaly detector was first trained with ten different executionof eject and more than a hundred executions of bsdtar. We alsoaudited eight instances of the activity of eject under attack, whilefor bsdtar we logged seven malicious executions. We repeated thetests both with a simple shellcode which opens a root shell (a simpleexecve of /bin/sh) and with our implementation of the userlandexec technique.

e overall results are summarized in Table .. Let us con-sider the effectiveness of the detection of the attacks themselves.

.. Forensics Use of Anomaly Detection Techniques

Executable Regular shellcode Userland execFPR DR FPR DR

eject bsdtar . .

Table .: Experimental results with a regular shellcode and withour userland exec implementation.

e attacks against eject are detected with no false positive at all.e exploit is detected in the very beginning: since a very long ar-gument is passed to the execve, this triggers the argument model.e detection accuracy is similar in the case of bsdtar, even if in thiscase there are some false positives. e detection of the shellcodehappens with the first open of the unexpected special file /dev/tty.It must be underlined that most of the true alerts are correctly firedat system call level; this means that malicious calls are flagged byour IDS because of their unexpected arguments, for instance.

On the other hand, exploiting the userland exec an attackerlaunches an otherwise normal executable, but of course such ex-ecutable has different system calls, in a different order, and withdifferent arguments than the ones expected in the monitored pro-cess. is reflects in the fact that we achieved a DR with noincrease in false positives, as each executable we have run throughSELF has produced a Markov model which significantly differsfrom the learned one for the exploited host process.

. H- A D

. Concluding Remarks

In this chapter we first described in detail two contributions to host-based anomaly detection and then demonstrated how these tech-niques can be successfully applied to circumvent anti-forensics toolsoften used by intruders to avoid to leave traces on the file-system.

First, we described a host-based IDS based on the analysis ofsystem calls arguments and sequences. In particular, we analyzedprevious literature on the subject, and found that there exists onlya handful of works which take into account the anomalies in sucharguments. We improved the models suggested in one of theseworks, we added a stage of clustering in order to characterize nor-mal invocations of calls and to better fit models to arguments, andfinally we complemented it with Markov models in order to capturecorrelation between system calls.

We showed how the prototype is able to correctly contextual-ize alarms, giving the user more information to understand whatcaused any false positive, and to detect variations over the execu-tion flow, as opposed to punctual variations over single instances.We also demonstrated its improved detection capabilities, and areduction of false positives. e system is auto-tuning and fullyunsupervised, even if a range of parameters can be set by the userto improve the quality of detection.

A possible future extension of the technique we described isthe analysis of complementary approaches (such as Markov modelmerging or the computation of distance metrics) to better detectanomalies in the case of long system call sequences, which we iden-tified as a possible source of false positives. In the following section,we describe how a deterministic behavioral model, i.e., an FSA, of-ten shows a better andmore accurate characterization of the processflow. However, to achieve better FPR of SADE we had to addseveral stochastic models to avoid many false detections; this, asexpected, is paid at the price of higher computational overheads.

Secondly, we demonstrated that a good alternative to using dis-tance metrics between Markov models is the exploiting determinis-tic models for the control flow. More precisely, we showed how thedeterministic dataflow relations between arguments implementedin FSA-DF can be improved by using better statistical models. Inaddition, we partially addressed the problem of spurious datasetsby introducing a Gaussian string model, which has been shown to

.. Concluding Remarks

be more resilient to outliers (i.e. too long or too short strings). Wealso proposed a new model for counting the frequency of traversalof edges on FSA-DF, to make it able to detect DoS attacks. Bothsystems needed an improved model for string (path) similarity. Weadapted the Symbol SOM algorithm to make it suitable for com-puting a distance between two paths. We believe that this is thecore contribution.

We tested and compared the original prototypes with an hybridsolution where the Symbol SOM and the edge traversal modelsare applied to the FSA, and a version of SADE enhanced withthe Symbol SOM and the correction to the execution argumentsmodel. Both the new prototypes have the same DRs of the originalones, but significantly lower FPRs. is is paid in terms of a non-negligible limit to detection speed, at least in our proof of conceptimplementation.

Future efforts will focus on re-engineering the prototypes to usean in-kernel system call interceptor, and generically improve theirperformance. We are studying how to speed up the Symbol SOMnode search algorithm, in order to bring the throughput to a ratesuitable for online use.

Last, we analyzed thewide class of definitive anti-forensics tech-niques which try to eliminate evidence by avoiding disk usage. Inparticular, we focused on in-memory injection techniques. Suchtechniques are widely used by modern attack tools (both commer-cial and open source).

As memory dump and analysis is inconvenient to perform, of-ten not part of standard operating procedures, and does not helpexcept in case of in-memory resident backdoors and rootkits, weproposed an alternative approach to circumvent such techniques.We illustrated how a prototype which analyzes (using learning al-gorithms) the sequence and the arguments of system calls to detectintrusions can be used to detect in-memory injections of executablecode, and in-memory execution of binaries.

We proposed an experimental setup using vulnerable versionsof two widely used programs on FreeBSD, eject and bsdtar. Wedescribed the creation of a training and testing dataset, how weadapted or created exploits for such vulnerabilities, and how werecorded audit data. We also developed an advanced in-memoryexecution payload, based on SELF, which implements the “user-land exec” technique through an injectable shellcode and a self-

. H- A D

loading object (a specially-crafted, statically linked ELF file). epayload executes any statically linked binary in thememory space ofa target process without calling the kernel and, more importantly,without leaving any trace on the hard disk of the attacked machine.

We performed several experiments, with excellent DRs for theexploits, but even more importantly with a DR for the in-memory execution payload itself. We can positively conclude thatour technique yields promising results for creating a forensic au-dit trail of otherwise “invisible” injection techniques. Future devel-opments will include a more extensive testing with different anti-forensics techniques, and the development of a specifically designedforensic output option for our prototype.

Anomaly Detection of Web-based Attacks

Anomaly detection techniques have been shown to be very effectiveat mitigating the widespread of malicious activity against web ap-plications. In fact, nowadays, the underground criminals’ preferredway of spreading malware consists in compromising vulnerable webapplications, deploying a phishing-, spamming-, malware-kit andinfecting an enormous amount of users that simply visit the web-site using a vulnerable browser (or plug-in). Further exacerbatingthe situation is the use of botnets to exhaustively compromise vastamounts of web applications. us, due to their popularity, webapplications play a significant role in the malware spreading work-flow. As a consequence, by blocking attacks against web applica-tions the majority of potential infections can be avoided. Indirectly,all the visitors of the website benefit from the adoption of web-based IDSs.

Unfortunately, even the most advanced anomaly detectors ofweb-based attacks are not with their drawbacks. In particular, inthis chapter we discuss our contributions to overcome two relevanttraining issues, overfitting and concept-drift, that both manifestthemselves as increased FPRs. First, we address overfitting due totraining data scarcity, which results in under-trained activity mod-els with poor generalization capabilities. Consequently, a consider-able amount of normal events is classified as malicious. Secondly,

. A D W- A

we propose a simple but extremely effective mechanism to detectchanges in the monitored system to adapt the models to what wenamed web application concept drift.

.. Preliminaries

. Preliminaries

e two contributions described in Section . and . are bothbased on the generic anomaly detection architecture described inthe following. In this section, a generic model of an anomaly-baseddetector of attacks against web applications is given. In addition,the details of the datasets used to evaluate the effectiveness of boththe approach are described.

.. Anomaly Detectors ofWeb-based AttacksUnless differently stated, we use the shorthand term anomaly detec-tor to refer to anomaly-based detectors that leverage unsupervisedmachine learning techniques. Let us first overview the basic con-cepts.

Without loss of generality, a set of web applications A can beorganized into a set of resource paths or components R, and namedparameters P . For example, A = {a1, a2} may contain a blog ap-plication a1 = blog.example.com and an e-commerce applicationa2 = store.example.com. ey can be decomposed into their re-source paths:

R1 = R2 =r1,1 = /article/,r1,2 = /comments/,r1,3 = /comments/edit/,r1,4 = /account/,r1,5 = /account/password/

r2,1 = /list/,r2,2 = /cart/,r2,3 = /cart/add/,r2,4 = /account/,r2,5 = /account/password/

In this example, resource path r1,5 might take a set of parameters,as part of the HTTP request. For instance, the following requestto r1,5 = /account/password/

GET /account/password/id/12/oldpwd/14m3/newpwd/1337/ HTTP/1.1

can be parsed into the following abstract data structure:

P1,5 =

⟨p1,5,1 = id v1,5,1 = 12⟩,⟨p1,5,2 = oldpw v1,5,2 = 14m3⟩,⟨p1,5,3 = newpw v1,5,3 = 1337⟩

.

A generic web application IDS based on unsupervised learningtechniques captures the system activity I as a sequence of requests

. A D W- A

Q = {q1, q2, . . .}, similar to those exemplified in Section ., is-sued by web clients to the set of monitored web applications. Eachrequest q ∈ Q is represented by a tuple ⟨ai, ri,j , Pq⟩, where Pq is aset of parameter name-value pairs such that Pq ⊆ Pi,j .

During the initial training phase, the anomaly detection systemlearns the characteristics of themonitoredweb applications in termsof models. As new web application, resource path, and parameterinstances are observed, the sets A, R, and P are updated. For eachunique parameter pi,j,k observed in association with a particularapplication ai and path ri,j , a set of models that characterize thevarious features of the parameter is constructed. An activity pro-file (Definition ..) associated with each unique parameter in-stance is generated c(.) =

⟨m(1),m2, . . . ,m(u), . . . ,m(U)

⟩, which

we name (parameter) profile or model composition. erefore, foreach application ai and resource path ri,j , a set Ci,j of model com-positions is constructed, one for each parameter pi,j,k ∈ Pi,j . eknowledge base of an anomaly detection system trained on web ap-plication ai is denoted by Cai =

∪j Ci,j . A graphical representation

of how a knowledge base is modeled for multiple web applicationsis depicted in Figure ..

Example .. Inwebanomaly, a profile for a given parameter pi,j,kis the following tuple:

ci,j,k = ⟨m(tok),m(int),m(len),m(char),m(struct)⟩.

Similarly to LibAnomaly:

m(tok) models parameter values as a set of legal tokens (i.e., the set ofof possible values for the parameter gender, observed duringtraining).

m(int) and m(len) describe normal intervals for literal integers andstring lengths, respectively, using the Chebyshev inequality.

m(char) models the character strings as a ranked frequency histogram,named Idealized Character Distribution (ICD), that are com-pared using the χ2 or G tests.

m(struct) models sets of character strings by inducing a HMM. eHMM encodes a probabilistic grammar that represents a su-perset of the strings observed in a training set. Aside from

.. Preliminaries

the addition of m(int), which is a straightforward general-ization of m(len) to numbers, the interested reader may referto [Kruegel et al., ] for further details.

In addition to requests, the structure of user sessions can betaken into account to model the normal states of a server-side ap-plication. In this case, the anomaly detector does not consider in-dividual requests independently, but models their sequence. ismodel captures the legitimate order of invocation of the resources,according to the application logic. An example is when a user isrequired to invoke an authentication resource (e.g., /user/auth) be-fore requesting a private page (e.g., /user/profile). In [Kruegelet al., ], a session S is defined as a sequence of resources inR. For instance, given R = {r1, r2, . . . , r10}, a sample session isS = ⟨r3, r1, r2, r10, r2⟩.

Some systems also model HTTP responses that are returned bythe server. For example, in [Kruegel et al., ], a model m(doc) ispresented that takes into account the structure of documents (e.g.,HyperText Markup Language (HTML), XML, and JavaScript Ob-ject Notation (JSON)) in terms of partial trees that include security-relevant nodes (e.g., <script /> nodes, nodes containing DOMevent handlers, and nodes that contain sensitive data such as creditcard numbers). ese trees are iterativelymerged as new documentsare observed, creating a superset of the allowed document structureand the positions within the tree where client-side code or sensitivedata may appear. A similar approach is adopted in [Criscione et al.,].

Note .. (Web Application Behavior) e concept of web ap-plication behavior, used in the following, is a particular case of thesystem behavior as defined by Definition ... Depending on theaccuracy of each model, the behavior of a web application reflectsthe characteristics and functionalities that the application offersand, as a consequence, the content of the inputs (i.e., the requests)that it process and the outputs (i.e., the responses) that it produces.us, unless differently stated, we use the same term to indicateboth the ideas.

After training, the system is switched to detection mode, whichis performed online. e models trained in the previous phase

. A D W- A

Ca1···

Ca

i···

Ca

I

Cr1

,1···

···C

r1

,j···

···C

ri,J

cp1

,1,1···

c(·)

···c(·)

···cp

i,j

,k···

c(·)

···c(·)

···cp

i,j

,K

‖‖

‖〈m

1 ,...,mU 〉

···〈m

1 ,...,mU 〉

···〈m

1 ,...,mU 〉

http

://blog.example.com,

...

http

://dav.example.com

/article,

/comments,

...

/account,

/account/password

id

=1

date

=10−

11−

2004

title

=foo

F .: Overview of web application model construction.

.. Preliminaries

are queried to determine whether or not the new parameters ob-served are anomalous. Without going into the details of a particularimplementation, each parameter is compared to all the applicablemodels and an aggregated anomaly score between and is calcu-lated by composing the values returned by the various models. Ifthe anomaly score is above a certain threshold, an alert is generated.For example, in [Kruegel et al., b], the anomaly score repre-sents the probability that a parameter value is anomalous and it iscalculated using a Bayesian network that encodes the probabilitythat a parameter value is actually normal or anomalous given thescores returned by the individual models.

.. A Comprehensive Detection System toMitigateWeb-based Attacks

e approach described in [Kruegel et al., ] are further ex-ploited in a recent work, [Criscione et al., ], which we partiallycontributed to. In particular, we defined a web application anomalydetector that is able to detect a real-world threats against the clients(e.g., malicious JavaScript code, trying to exploit browser vulnerabil-ities), the application (e.g., cross-site scripting, permanent contentinjection), and the database layer (e.g., SQL injection). A pro-totype of the system, called Masibty, has been evaluated on a setof real-world attacks against publicly available applications, usingboth simple and mutated versions of exploits, in order to assess theresilience to evasion. Masibty has the following key characteristics:

• its models are designed with the explicit goal of not requiringan attack-free dataset for training, which is an unrealistic re-quirement in real-world applications. Even if in [Cretu et al.,a] techniques are suggested to filter outliers (i.e., attacks)from the training data, in absence of a ground truth there canbe no guarantee that the dataset will be effectively free of at-tacks. Using such techniques before training Masibty wouldsurely improve its detection capabilities.

• As depicted in Figure .,Masibty intercepts and process bothHTTP requests (i.e., PAnomaly) and responses (i.e., XSS-Anomaly) and protects against both server-side and client-side attacks, an extremely important feature in the upcom-ing “Web .” era of highly interactive websites based mainly

. A D W- A

on user contributed content. In particular, we devised twonovel anomaly detection models — referred to as “engines”— based on the representation of the responses as trees.

• Masibty incorporates an optional data protection component,i.e., QueryAnomaly that extracts and parses the SQL queriessent to the database server. is component is part of theanalysis of HTTP requests, and thus is not merely a reverseproxy to the database. In fact, it allows to bind the requeststo the SQL queries that they generate, directly or indirectly.Hence, queries can be modeled although are not explicitlypassed as a parameter of the HTTP requests.

.. Evaluation Datae experiments in Section . and . were conducted using adataset drawn from real-world web applications deployed on bothacademic and industry web servers. Examples of representative ap-plications include payroll processors, client management, and on-line commerce sites. For each application, the full content of eachHTTP connection observed over a period of several months wasrecorded. e resulting flows were then filtered using the most ad-vanced signature-based detection system, Snort, to remove knownattacks. In total, the dataset contains distinct web applica-tions, , unique resource paths, , unique parameters, and,, HTTP requests.

Note .. (Dataset privacy) To preserve the privacy, the datasetused has been given to the Computer Security Laboratory of UCSanta Barbara under strict contractual agreements that denies todisclose specific information identifying theweb applications them-selves.

Two sets of , and , attacks was introduced into thedataset used for the approach described in Section . and ., re-spectively. ese attacks were real-world examples and variationsupon XSS (e.g., CVE--), SQL injections (e.g., CVE--), and command execution exploits (e.g., CVE--) that manifest themselves in request parameter values, which

Source and attack signatures available for download at http://snort.org.

.. Preliminaries

Clien

tH

TT

Pin

spec

tion

AR

Web

serv

erSQ

Lin

spec

tion

AR

DB

serv

er

PA

nom

aly

:Token

,D

istr

ibuti

on,

Len

gth

,P

rese

nce

,O

rder

XSSA

nom

aly

:C

risp

,JSE

ngin

e,Tem

pla

te

Quer

yA

nom

aly

:Str

uct

ure

(ign

ore

d)

reques

tre

ques

tquer

yquer

y

resp

onse

resp

onse

resu

lts

resu

lts

F .: e logical structure of Masibty.

. A D W- A

remain the most common attacks against web applications. Repre-sentative examples of these attacks include:

• malicious JavaScript inclusion<script src=”http://example.com/malware.js”></script>

• bypassing login authentication’ OR ’x’=’x’--

• command injectioncat /etc/passwd | mail [email protected] \#

More precisely, the XSS attacks are variations on those listed in[Robert Hansen, ], the SQL injections were created similarlyfrom [Ferruh Mavituna, ], and the command execution ex-ploits were variations of common command injections against theLinux and Windows platforms.

.. Training With Scarce Data

. TrainingWith Scarce Data

In this section, we describe our contributions [Robertson et al.,] to cope with the difficulty of obtaining sufficient amounts oftraining data. As we explained in Section .. and ., this limita-tion has heretofore not been well studied. We developed this tech-nique to work with webanomaly and, in general, to any learning-based anomaly detection system against web attacks. In fact, theproblem described in the following is significantly evident in thecase of web applications. However, it can be easily applied to otheranomaly-based system, as long as they rely on behavioral modelsand learning techniques such as those described in Section ..

e issue that motivates this work is that the number of web ap-plication component invocations is non-uniformly distributed. Infact, we noticed that relatively few components are dominant in thetraffic and the remainder components are accessed relatively infre-quently. erefore, for those components, it is difficult to gatherenough training data to accurately model their normal behavior.In statistics, the infrequently accessed population is known as the“long tail”. Note that, however, this does not necessarily imply apower law distribution. Clearly, components that are infrequentlyaccessed lead to undertrained models (i.e., models that do not cap-ture the normal behavior accurately). Consequently, models aresubject to overfitting due to lack of data, resulting in increases inthe FPR.

In particular, in this section, we describe our joint work withUC Santa Barbara in which we propose to mitigate this problem byexploiting natural similarities among all web applications. In par-ticular, we show that the values of the parameters extracted fromHTTP requests can generally be categorized according to theirtype, such as an integer, date, or string. Indeed, our experimentsdemonstrate that parameters of similar type induce similar modelsof normal behavior. is result is then exploited to supplement alocal scarcity of training data for a given web application compo-nent with similar data from other web applications.

is section is structured as follows. In Section .. the prob-lem of the non-uniform distribution to the different componentsof a web application is discussed. In addition, we provide evidencethat it occurs in the real world. In Section .. approach to addressthe problem of model profile undertraining by using the notion of

. A D W- A

global profiles that exploit similarities between web application pa-rameters of similar type. In Section .. the application of globalprofiles is evaluated on a large dataset of real-world traffic frommany web applications, and demonstrate that utilizing global pro-files allows anomaly detectors to accurately model web applicationcomponents that would otherwise be associated with undertrainedmodels.

.. Non-uniformly distributed training dataTo describe the undertraining problem, in this section we refer tothe aforementioned, generic architecture of a web-based anomalydetector. To address the problem of undertraining, we leverage theset of models described in the Example ...

Because anomaly detection systems dynamically learn specifica-tions of normal behavior from training data, it is clear that the qual-ity of the detection results critically relies upon the quality of thetraining data. For example, as mentioned in Section ., a train-ing dataset should be attack-free and should accurately representthe normal behavior of the modeled features. To our knowledge,the difficulty of obtaining sufficient amounts of training data toaccurately model web applications is less well-known. In a sense,this issue is similar to those addressed by statistical analysis meth-ods with missing data [Little and Rubin, ]. Although a train-ing procedure would benefit from such mechanisms, they require acomplete redesign of the training algorithm specific to each model.Instead, a non-obtrusive approach that can improve an existing sys-tem without modifying the undertrained models is more desirable.Typically, anomaly-based detectors cannot assume the presence ofa testing environment that can be leveraged to generate realistictraining data that exercises the web application in a safe, attack-freeenvironment. Instead, an anomaly detection system is typically de-ployed in front of live web applications with no a priori knowledgeof the applications’ components and their behavior.

In particular, in the case of low-traffic web applications, prob-lems arise if the rate of client requests is inadequate to allow modelsto train in a timely manner. Even in the case of high-traffic webapplications, however, a large subset of resource paths might fail toreceive enough requests to adequately train the associated models.is phenomenon is a direct consequence of the fact that resource

.. Training With Scarce Data

R P R/article .

/comments ./comments/edit .

/account ./account/password .

Table .: Client access distribution on a real-world web applica-tion (based on , requests per day).

path invocations issued by web clients often follow a non-uniformdistribution. To illustrate this point, Figure . plots the normal-ized cumulative distribution function of web client resource pathinvocations for a variety of real-world, high-traffic web applications(details on the source of this data are provided in Section ..). Al-though several applications have an approximately uniform clientaccess distribution, a clear majority exhibits highly skewed distribu-tions. Indeed, in many cases, a large percentage of resource pathsreceive a comparatively minuscule number of requests. us, re-turning to the example mentioned in Section .., assuming anoverall request volume of , requests per day, the resourcepath set would result in the client access distribution detailed inTable ..

Profiles for parameters to resource paths such as /article willreceive ample training data, while profiles associated with accountcomponents will be undertrained. e impact of the problem isalso magnified by the fact that components of a web applicationthat are infrequently exercised are also likely to contain a dispro-portionately large portion of security vulnerabilities. is could bea consequence of the reduced amount of testing that developersinvariably perform on less prominent components of a web appli-cation, resulting in a higher rate of software flaws. In addition, therelatively low request rate from users of the web application resultsin a reduced exposure rate for these flaws. Finally, when flaws areexposed and reported, correcting the flaws may be given a lowerpriority than those in higher traffic components of a web applica-tion.

. A D W- A

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Ac

cess

CDF

Resources

Site 1Site 2Site 3Site 4Site 5Site 6Site 7Site 8Site 9

Site 10Site 11Site 12Site 13Site 14Site 15Site 16Site 17Site 18Site 19Site 20Site 21Site 22Site 23Site 24

F .: Web client resource path invocation distributions froma selection of real-world web applications.

.. Exploiting global knowledgeeapproach described in this section is based on a key observation:parameters associated with the invocation of components belong-ing to different web applications often exhibit a marked similarityto each other. For instance, many web applications take an inte-ger value as a unique identifier for a class of objects such as a blogarticle or comment, as in the case of the id parameter. Similarly,applications also accept date ranges similar to the date parameteras identifiers or as constraints upon a search request. Similarly, asin the case of the title parameter, web applications often expect ashort phrase of text as an input, or perhaps a longer block of text inthe form of a comment body. In some sense, each of these group-ings of similar parameters can be considered as distinct parametertypes, though this need not necessarily correspond to the concept oftypes as understood in the programming languages context.

Our approach is based in the following assumption. Parame-ters of the same type tend to induce model compositions that aresimilar to each other in many respects. Consequently, if the lackof training data for a subset of the components of a web applica-tion prevents an anomaly detection system from constructing ac-curate profiles for the parameters of those components, we claim

.. Training With Scarce Data

Qa1

Qa2

...

QaI

Training

Training

Training

Ca1

Ca2

CaI

Clustering Querying c (well-trained)

c′ (undetrained)

CI C3rd phase (online)2nd phase (offline)

1st phase (offline)

1

F .: Overall procedure. Profiles, both undertrained andwell-trained, are collected from a set of web applications. eseprofiles are processed offline to generate the global knowledge baseC and index CI . C can then be queried to find similar global profiles.

that it is possible to substitute, with an acceptably low decreasein the DR, profiles for similar parameters of the same type thatwere learned when enough training data was available to constructaccurate profiles. It must be underlined that the substitution op-erates at the granularity of parameters rather than requests (whichmay contain more than one parameter). is increases the likeli-hood of finding applicable profile similarities, and allows for thesubstitution of models taken from radically different components.However, although the experiments on real-world data, describedin Section .., confirm that the aforementioned insight is real-istic, our hypothesis might not hold in some very specific settings.us, to minimize the risks brought by migrating global knowledgeacross different deployments, we interpreted this result only as aninsight and developed a robust criterion able to find similar profilesindependently from the actual types of the modeled parameters.

As schematized in Figure . our approach is composed of threephases.

First phase (offline) Enhancement of the training procedure orig-inally implemented in [Kruegel et al., ].

Second phase (offline) It is divided into three sub-steps.

. A global knowledge base of profiles C =∪

aiCai is con-

structed, where Caiare knowledge bases containing only

well-trained, stable profiles from anomaly detection sys-tems previously deployed on a set of web applications∪

i ai.

. A D W- A

. A knowledge base CI =∪

aiCIai

of undertrained pro-files is then constructed as an index into C, where CI

aiis

a knowledge base of undertrained profiles from the webapplication ai.

. Finally, a mapping f :{CI}× Cai 7→ C is defined.

ird phase (online) For any new web application where insuffi-cient training data is available for a component’s parameter,the anomaly detector first extracts the undertrained profile c′from the local knowledge base Cai . en, the global knowl-edge base C is queried to find a similar, previously constructedprofile f (CI , c′) = c ∈ C. e well-trained profile c is thensubstituted for the undertrained profile c′ in the detectionprocess.

e following sections detail how CI and C are constructed, andhow f maps elements in CI and Cai to elements in C.

... First Phase: Enhanced Training

A significant refinement of the individualmodels described in [Kruegelet al., ] is the criterion used to determine the length of thetraining phase. An obvious choice is to fix a constant traininglength, e.g., a thousand requests. Unfortunately, an appropriatetraining length is dependent upon the complexity of modeling agiven set of features. erefore, we have developed an automatedmethod that leverages two stop criteria, model stability and modelconfidence, to determine when a model has observed enough train-ing samples to accurately approximate the normal behavior of a pa-rameter.

Model Stability As new training samples are observed early inthe training phase, the state of a model typically exhibits frequentand significant change as its approximation of the normal behaviorof a parameter is updated. Informally, in an information-theoreticsense, the average information gain of new training samples is high.As a model’s state converges to a more precise approximation ofnormal behavior, however, its state exhibits infrequent and incre-mental changes. In other words, the information gain of new train-ing samples approaches zero, and the model stabilizes. Note that,

.. Training With Scarce Data

we refer to the information gain as an informal and intuitive con-cept to explain the rationale behind the development of a soundmodel stability criterion. By no means we claim that our procedurerelies on the actual information gain to calculate when a model con-verges to stability.

Each model self-assesses its stability during the training phaseby maintaining a history of snapshots of its internal state. At regu-lar intervals, a model checks if the sequence of deltas between eachsuccessive historical state is monotonically decreasing and whetherthe degree of change drops below a certain threshold. If both con-ditions are satisfied, then the model considers itself stable. Letκ(u)stable denote the number of training samples required for a model

to achieve stability. A profile is considered stable when all of itsconstituent models are stable.

Definition .. (Profile stability) Let κstable be the number of train-ing samples required for a single model to achieve stability. eaggregated stability measure is defined as:

κstable = maxu∈U

κ(u)stable. (.)

e notion ofmodel stability is also leveraged in the third phase,as detailed in Section ....

Note .. Instead of describing the internal stop criterion specificto each model, if any, we developed a model-agnostic minimiza-tion algorithm detailed in Section ... (and evaluated in Sec-tion ...) that allows one to trade off detection accuracy againstthe number of training samples available.

Model Confidence A related notion to model stability is that ofmodel confidence. Recall that the goal of a model is to learn an ab-straction of the normal behavior of a feature from the training data.ere exists a well-known tension in the learning process betweenaccurately fitting the model to the data and generalizing to datathat has not been observed in the training set. erefore, in ad-dition to detecting whether a model has observed enough trainingsamples to accurately model a feature, each model incorporates aself-confidence measure z(u) that represents whether the model hasgeneralized so much as to lose all predictive power.

. A D W- A

For example, the confidence of a token model should be directlyrelated to the probability that a token set is present.

Definition .. (Token model confidence) e token model confi-dence is defined as:

z(tok) = τ =1

2+

nc − nd

4n (n− 1), (.)

where nc and nd are the number of concordant and discordantpairs, respectively, between the unique number of observed samplesand the total number of samples observed at each training step, andn is the total number of training samples.

Note that, zm(tok) is an adaptation of the τ coefficient [Kendall,].

For the integer and length models, the generality of a particularmodel can be related to the statistical dispersion of the training set.

Definition .. (Integer model confidence) Given the observed stan-dard deviation σ, minimum observed value u, and maximum ob-served value v, the confidence of an integer or length model is definedas:

z({int,len}) =

{1− σ

v−u if v − u > 0

1 otherwise. (.)

e confidence of a character distribution model is determinedby the variance of observed ICDs. erefore, the confidence of thismodel is given by a similar construction as the previous one, exceptthat instead of operating over observed values, themeasure operatesover observed variances.

Definition .. (Char. distr. model confidence) Given the stan-dard deviation of observed variances σ, theminimumobserved vari-ance u, and the maximum observed variance v, the confidence of acharacter distribution model is:

z(char) =

{1− σ

v−u if v − u > 0

1 otherwise. (.)

.. Training With Scarce Data

Finally, the HMM induced by the structure model can be di-rectly analyzed for generality. We perform a rough estimate of thisby computing themean probability for any symbol from the learnedalphabet to be emitted at any state.

Definition .. (Struct. model confidence) Given a structuralmodel,i.e. anHMM, specified by the tuple ⟨S,O,MS×S, P (S,O) , P (S)⟩,where S is the set of states, O is the set of emissions, MS×S is thestate transition probability matrix, P (S,O) is the emission prob-ability distribution over the set of states, and P (S) is the initialprobability assignment, the structural model confidence:

z(struct) = 1− 1

|S| |O|

|S|∑i=1

|O|∑j=1

P (si, oj) . (.)

At the end of this phase, for each web application ai, the profilesare stored in the corresponding knowledge base Cai and are readyto be processed by the next phase.

... Second Phase: Building a global knowledge base

is phase is divided into the three sub-steps described in the fol-lowing.

Well-trained profiles e construction of C begins by merging acollection of knowledge bases {Ca1 , Ca2 , . . . , Can} that have previ-ously been built by a web application anomaly detector over a setof web applications

∪i ai. e profiles in C are then clustered in

order to group profiles that are semantically similar to each other.Profile clustering is performed in order to time-optimize query ex-ecution when indexing into C, as well as to validate the notion ofparameter types. In this work, an agglomerative hierarchical clus-tering algorithm using group average linkage was applied, althoughthe clustering stage is, in principle, agnostic as to the specific al-gorithm. For an in-depth discussion of clustering algorithms andtechniques, we refer the reader to [Xu and Wunsch, ].

Central to any clustering algorithm is the distance function,which defines how distances between the objects to be clusteredare calculated. A suitable distance function must reflect the se-

. A D W- A

mantics of the objects under consideration, and should satisfy twoconditions:

• the overall similarity (i.e., inverse of the distance) betweenelements within the same cluster is maximized, and

• the similarity between elements within different clusters isminimized.

We define the distance between two profiles to be the sum ofthe distances between the models comprising each profile. Moreformally.

Definition .. (Profile distance) e distance between two profilesci and cj is defined as:

d (ci, cj) =1

|ci∩cj |

∑m

(u)i ,m

(u)j ∈ci

∩cj

δu

(m

(u)i ,m

(u)j

), (.)

where δu : Mu ×Mu 7→ [0, 1] is the distance function definedbetween models of type u ∈ U = {tok, int, len, char, struct}.

e token model m(tok) is represented as a set of unique tokensobserved during the training phase. erefore, two token modelsm

(tok)i and m

(tok)j are considered similar if they contain similar sets

of tokens. More formally.

Definition .. (Token models distance) e distance function fortoken models is defined as the Jaccard distance [Cohen et al., ]:

δtok

(m

(tok)i ,m

(tok)j

)= 1−

∣∣∣m(tok)i

∩m

(tok)j

∣∣∣∣∣∣m(tok)i

∪m

(tok)j

∣∣∣ . (.)

e integer model m(int) is parametrized by the sample meanµ and variance σ of observed integers. Two integer models m(int)

i

and m(int)j are similar if these parameters are also similar.

.. Training With Scarce Data

Definition .. (Integer models distance) e distance function forinteger models is defined as:

δint

(m

(int)i ,m

(int)j

)=

∥∥∥σ2i

µ2i− σ2

j

µ2j

∥∥∥σ2i

µ2i+

σ2j

µ2j

. (.)

Note .. (Length models distance) As the string length modelis internally identical to the integer model, its distance functionδlen is defined similarly.

Recall that the character distribution model m(char) learns thefrequencies of individual characters comprising strings observedduring the training phase. ese frequencies are then ranked andcoalesced into an bins to create an ICD. Two character distributionmodels m

(char)i and m

(char)j are considered similar if each model’s

ICDs are similar. More formally.

Definition .. (Char. distr. models distance) e character distri-bution models distance is defined as:

δchar

(m

(char)i ,m

(char)j

)=

n∑l=0

∥bi (l)− bj (l)∥maxk=i,j bk (l)

, (.)

where bi (k) is the value of bin k for m(char)i .

e structural model m(struct) builds an HMM by observing asequence of character strings. e resulting HMM encodes a prob-abilistic grammar that can produce a superset of the strings ob-served during the training phase. e HMM is specified by thetuple ⟨S,O,MS×S, P (S,O) , P (S)⟩. Several distance metrics havebeen proposed to evaluate the similarity between HMMs [Stolckeand Omohundro, a; Lyngsøet al., ; Stolcke and Omo-hundro, a; Juang and Rabiner, ]. eir time complexity,however, is non-negligible. erefore, we adopt a less precise, butconsiderably more efficient, distance metric between two structuralmodels m(struct)

i and m(struct)j .

Definition .. (Struct. models distance) e structuralmodels dis-tance is defined as the Jaccard distance between their respectiveemission sets:

. A D W- A

.

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

.κ = 1

.κ = 2

.κ = 4

.κ = 8

.κ = 16

.κ = 32

.κ = 64

. . . . . .

. . . .

. .

.

.Requ

ests

inQ

(p)

F .: Partitioning of the training set Q for various κ.

δstruct

(m

(struct)i ,m

(struct)j

)= 1− |Oi

∩Oj |

|Oi

∪Oj |

. (.)

Undertrainedprofiles Once the global knowledge base C has beenbuilt, the next step is to construct a knowledge base of undertrainedmodels CI . is knowledge base is composed of profiles that havebeen deliberately undertrained on κ ≪ κstable for each of the webapplications ai. Because each member of CI uniquely correspondsto a member in C and undertrained profiles are built for each well-trained profile in C, a bijective mapping f ′ : CI 7→ C exists betweenCI and C. erefore, when a web application parameter is iden-tified as likely to be undertrained, the corresponding undertrainedprofile c′ can be compared to a similar undertrained profile in CI ,which is then used to select a corresponding stable profile from C.is operation requires the knowledge base CI to be indexed. econstruction of the indices relies on the notion of model stability,described in Section ....

e undertrained profiles that comprise CI are constructed us-ing the following procedure. Let Q(p) = {q(p)1 , q

(p)2 , . . .} denote a

sequence of client requests containing parameter p for a given webapplication. Over this sequence of requests, profiles are deliber-ately undertrained on randomly sampled κ-sequences. Each of the

.. Training With Scarce Data

resulting profiles is then added to the set CI . Figure . depictsthis procedure for various κ. Note that the random sub-samplingis performed with the goal of inducing undertraining to show thatclustering is feasible and leads to the desired grouping even – andespecially – in the presence of undertraining.

Recall that κstable is the number of training samples required fora profile to achieve stability. en, a profile is considered under-trained when the number of training samples it has observed, κ,is significantly less than the number required to achieve stability.erefore, an undertrained knowledge base CI is a set of profilesthat have observed κ samples such that κ ≪ κstable.

e selection of an appropriate value for κ is central to both theefficiency and the accuracy of this process. Clearly, it is desirable tominimize κ in order to be able to index into C as quickly as possibleonce a parameter has been identified to be subject to undertrainingat runtime. On the other hand, setting κ too low is problematic,as Figure . indicates. For low values of κ, profiles are distributedwith relatively high uniformity within CI , such that clusters in CI

are significantly different than clusters of well-trained profiles in C.erefore, slight differences in the state of the individual modelscan cause profiles that are close in CI to map to radically differentprofiles in C. As κ → κstable, however, profiles tend to form mean-ingful clusters, and tend to approximate those found in C. ere-fore, as κ increases, profiles that are close in CI become close in Cunder f – in other words, f becomes robust with respect to modelsemantics.

Note .. (Robustness) Our use of the term “robustness” is re-lated, but not necessarily equivalent, to the definition of robustnessin statistics (i.e., the property of a model to perform well even inthe presence of small changes in the underlying assumptions).

Mapping undertrained profiles to well-trained profiles A prin-cipled criterion is required for balancing quick indexing against arobust profile mapping. Accordingly, we first construct a candidateknowledge base CI

κ for a given κ, as in the case of the global knowl-edge base construction. en, we define a robustness metric as fol-lows. Let HI =

∪i h

Ii be the set of clusters in CI , and H =

∪i hi

be the set of clusters in C. Let g : HI 7→ N be a mapping from anundertrained cluster to the maximum number of elements in that

. A D W- A

.. ..cp1,1

. ..cp2,1

. ..cp3,1

. ..cp4,1

. ..cp5,1

. ..cp6,1

. ..cp7,1

. ..cp8,1

. ..cp9,1. ..cp10,1

. ..cp1,2

. ..cp2,2

. ..cp3,2

. ..cp4,2

. ..cp5,2

. ..cp6,2. ..cp7,2

. ..cp8,2

. ..cp9,2

. ..cp10,2

. ..cp1,4. ..cp2,4

. ..cp3,4

. ..cp4,4

. ..cp5,4

. ..cp6,4

. ..cp7,4

. ..cp8,4

. ..cp9,4. ..cp10,4

. ..cp1,κ

. ..cp2,κ

. ..cp3,κ

. ..cp4,κ

. ..cp5,κ

. ..cp6,κ. ..cp7,κ

. ..cp8,κ

. ..cp9,κ. ..cp10,κ

. ..cp1. ..cp2

. ..cp3

. ..cp4. ..cp5

. ..cp6. ..cp7

. ..cp8. ..cp9. ..cp10

. ..c....

.

.

.

.

.

.

.

..

.

.

..

.

.

.κ = 0

.κ = 1

.κ = 2

.κ = 4

.κ = κmin

.κ = κstable ≫ κmin

.CI1

.CI2

.CI4

.CIκmin

.Well

-tra

ined

.C

.

F .: Procedure for building global knowledge base indices.

.. Training With Scarce Data

cluster that map to the same cluster in C. e robustness metricρ :{CI}7→ [0, 1] is then defined as

ρ(CI)=

1

|CI |∑i

g(hIi

). (.)

With this metric, an appropriate value for κ can now be chosenas

κmin = minκ

(ρ(CIκ

)≥ ρmin

), (.)

where ρmin is a minimum robustness threshold.With the construction of a global knowledge base C and an un-

dertrained knowledge base CI , online querying of C can then beperformed.

... ird Phase: Querying and Substitution

Given an undertrained profile c′ from an anomaly detector de-ployed over a web application ai, the mapping f :

{CI}×Cai 7→ C

is defined as follows. A nearest-neighbor match is performed be-tween c′ and the previously constructed clustersHI from CI to dis-cover the most similar cluster of undertrained profiles. is is doneto avoid a full scan of the entire knowledge base, which would beprohibitively expensive due to the cardinality of CI .

en, using the same distance metric defined in Equation (.),a nearest-neighbor match is performed between c′ and the mem-bers of HI to discover the most similar undertrained profile cI . Fi-nally, the global, well-trained profile f ′ (cI) = c is substituted forc′ for the web application ai.

To make explicit how global profiles can be used to address ascarcity of training data, consider the example introduced in Sec-tion ... Since the resource path /account/password has receivedonly requests (see Table .: /account/password has received0.02 of , total requests), the profiles for each of its pa-rameters {id, oldpw, newpw} are undertrained, as determined by eachprofile’s aggregate stability measure. In the absence of a globalknowledge base, the anomaly detector would provide no protec-tion against attacks manifesting themselves in the values passed toany of these parameters.

. A D W- A

If, however, a global knowledge base and index are available, thesituation is considerably improved. Given C and CI , the anomalydetector can simply apply f for each of the undertrained parame-ters to find a well-trained profile from the global knowledge basethat accurately models a parameter with similar characteristics fromanother web application. en, these profiles can be substituted forthe undertrained profiles for each of {id, oldpw, newpw}. As will bedemonstrated in the following section, the substitution of globalprofiles provides an acceptable detection accuracy for what wouldotherwise be an unprotected component (i.e., without a global pro-file of the attacks against that component would be detected).

Note .. (HTTP parameters “semantics”) It is important to un-derline that the goal of this approach is not that of understandingthe types of the fields nor their semantic. However, if the systemadministrator and the security officer had enough time and skills,the use of manual tools such as live debuggers or symbolic executionengines may help to infer the types of the fields and, once available,this information may be exploited as a lookup criterion. Unfortu-nately, this would be a human-intensive task, thus far from beingautomatic.

Instead, our goal is to find groups of similar models, regardlessof what this may imply in terms of types. According to our experi-ments, similar models tend to be associated to similar types. us,in principle, our method cannot be used to infer the parameters’types but can certainly give insights about parameters with similarfeatures.

.. Experimental Results

e goal of this evaluation is three-fold. First, we investigate theeffects of profile clustering, and support the hypothesis of param-eter types by examining global knowledge base clusters. en, wediscuss how the quality of the mapping between undertrained pro-files and well-trained profiles improves as the training slice length κis increased. Finally, we present results regarding the accuracy of aweb application anomaly detection system incorporating the appli-cation of a global knowledge base to address training data scarcity.

.. Training With Scarce Data

... Profile clustering quality

To evaluate the accuracy of the clustering phase, we first built aglobal knowledge base C from a collection of well-trained profiles.e profiles were trained on a subset of the data mentioned inSection ..; this subset was composed of web applications,, unique resource paths, , unique parameters, and ,,HTTP requests. e clustering algorithmdescribed in Section ...was then applied to group profiles according to their parametertype. Sample results from this clustering are shown in Figure .b.Each leaf node corresponds to a profile and displays the parametername and a few representative sample values corresponding to theparameter.

As the partial dendrogram indicates, the resulting clusters inC are accurately clustered by parameter type. For instance, dateparameters are grouped into a single hierarchy, while unstructuredtext strings are grouped into a separate hierarchy.

e following experiment investigates how κ affects the qualityof the final clustering.

... Profile mapping robustness

Recall that in order to balance the robustness of the mapping fbetween undertrained profiles and global profiles against the speedwith which undertraining can be addressed, it is necessary to selectan appropriate value for κ. To this end, we generated undertrainedknowledge bases for increasing values of κ = 1, 2, 4, 8, 16, 32, 64from the same dataset used to generate C, following the procedureoutlined in Section .... Partial dendrograms for various κ arepresented in Figure .c, .d, .a.

At low values of κ (e.g., Figure .c), the clustering processexhibits non-negligible systemic errors. For instance, the parameterstat clearly should be clustered as a token set of states, but insteadis grouped with unstructured strings such as cities and addresses.A more accurate clustering would have dissociated the token andstring profiles into well-separated sub-hierarchies.

As shown in Figure .d, larger values of κ lead to more mean-ingful groupings. Some inaccuracies are still noticeable, but theclustering process of the sub-hierarchy is significantly better thatthe one obtained at κ = 8. A further improvement in the clusters

. A D W- A

notes: {1/29/07 - email, 10/26/06 thru, spoke}notes: {1/29/07 - email, 10/26/06 thru, spoke} notes: {1/29/07 - email, 10/26/06 thru, spoke} notes: {1/29/07 - email, 10/26/06 thru, spoke}type: {health care, wholesale}notes: {1/29/07 - email, 10/26/06 thru, spoke}name: {Foo LLC, Bar Inc.}type: {General Auto, Painters, unknown}

(a) κ = 64

stdate: {01/01/1900, 04/01/2007, 05/01/2007}stdate: {01/01/1900, 04/01/2007, 05/01/2007}ref: {01/29/2007, 01/30/2007, 01/31/2007}stdate: {02/19/2004, 09/15/2005, 12/07/2005}stdate: {01/01/1900, 05/08/2006}stdate: {01/31/2006, 11/01/2006}stdate: {01/30/2007, 02/10/2007}indate: {01/29/2007, 12/29/2006}exp: {01/01/2008, 05/22/2007}exp: {02/09/2007, 09/30/2006}exp: {02/01/2007, 08/01/2006, 09/01/2006}date: {1/29/07, 12/31/2006}date: {1/29/07, 12/31/2006}note: {10-5, no ans, called and emailed, no client resp}note: {10-5, no ans, called and emailed, no client resp}

(b) κstable ≃ 103

city: {OUR CITY, OTHER CITY, San Diego}stat: {GA}code: {OD}w: {Ineligible, Old}cd: {XX}type: {Payment, Sales}code: {OD}w: {Eligible, New}w: {Ineligible, New}stat: {CA, TX}stat: {CA, TX}addr: {15 ROOF AVE, 373 W SMITH, 49 N Ave}

(c) κ = 8

thepage: {TKGGeneral, TKGGeneral, KZDA.pdf} updateTask: {TKGGeneral, KZDA.pdf, Chan.cfm?taskna} code: {CK-1006, NES}thepage: {TKGGeneral, TKGGeneral, KZDA.pdf}code: {CK-1006, NES} thepage: {TKGGeneral, TKGGeneral, KZDA.pdf}accode: {r94, xzy}code: {CK-1006, NZS}code: {CK-1006, NZS}code: {02-286, BE2}thepage: {TKGGeneral, TKGGeneral, KZDA.pdf}

(d) κ = 32

F .: Clustering of C, (a-b), and CI , (c-d). Each leaf (aprofile) is labeled with the parameter name and samples values ob-served during training. As κ increases, profiles are clustered moreaccurately.

.. Training With Scarce Data

is shown in Figure .a. At κ = 64, the separation between datesand unstructured strings is sharper; except for one outlier, the twotypes are recognized as similar and grouped together in the earlystages of the clustering process.

Figure . plots the profile mapping robustness ρ(·) against κfor different cuts of the dendrogram, indicated by Dmax. Dmax isa threshold representing the maximum distance between two clus-ters. Basically, for low Dmax, the “cut” will generate many clusterswith a few elements; on the other hand, for high values of Dmax thealgorithm will tend to form less clusters, each having a larger num-ber of elements. Note that this parameter is known to have two dif-ferent possible interpretations: it could indicate either a thresholdon the real distance between clusters, or a “cut level” in the dendro-gram constructed by the clustering algorithm. Although they areboth valid, we prefer to utilize the former.

Figure . shows two important properties of our technique.First, it demonstrates that the robustness is fairly insensitive toDmax. Second, the robustness of the mapping increases with κ untilsaturation at 32 ≤ κ ≤ 64. is not only confirms the soundnessof the mapping function, but it also provides insights on the appro-priate choice of κmin to minimize the delay to global profile lookupwhile maximizing the robustness of the mapping.

... Detection accuracy

Having studied the effects of profile clustering and varying valuesfor κ upon the robustness of the profile mapping f , a separate ex-periment was conducted in order to evaluate the detection accuracyof a web application anomaly detector incorporating C, the globalknowledge base constructed in the previous experiments. In partic-ular, the goal of this experiment is to demonstrate that an anomalydetector equipped with a global knowledge base exhibits an im-proved detection accuracy in the presence of training data scarcity.

e data used in this experiment was a subset of the full datasetdescribed above, and was completely disjoint from the one used toconstruct the global knowledge base and its indices. It consistedof unique real-world web applications, , unique resourcepaths, , distinct parameters, and ,, HTTP requests.

e intended threat model is that of an attacker attempting tocompromise the confidentiality or integrity of data exposed by a

. A D W- A

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60 70

Map

ping

rob

ustn

ess

κ

Dmax = 0.10Dmax = 0.125

Dmax = 0.15Dmax = 0.175

Dmax = 0.20Dmax = 0.225

Dmax = 0.25Dmax = 0.30Dmax = 0.35

Dmax = 0.375Dmax = 0.40

Dmax = 0.425Dmax = 0.45

Dmax = 0.475Dmax = 0.50

F .: Plot of profile mapping robustness for varying κ.

web application by injecting malicious code in request parameters.

Note .. (reat model) Although the anomaly detector used inthis study is capable of detectingmore complex session-level anoma-lies, we restrict the threat model to request parameter manipulationbecause we do not address session profile clustering.

To establish a worst-case bound on the detection accuracy of thesystem, profiles for each observed request parameter were deliber-ately undertrained to artificially induce a scarcity of training data forall parameters. at is, for each value of κ = 1, 2, 4, 8, 16, 32, 64,the anomaly detector prematurely terminated profile training afterκ samples, and then used the undertrained profiles to query C. eresulting global profiles were then substituted for the undertrainedprofiles and evaluated against the rest of the dataset. e sensitivityof the system was varied over the interval [0, 1], and the resultingROC curves for each κ are plotted in Figure ..

As one can clearly see, low values of κ result in the selectionof global profiles that do not accurately model the behavior of theundertrained parameters. As κ increases, however, the quality ofthe global profiles returned by the querying process increases aswell. In particular, this increase in quality closely follows the map-

.. Training With Scarce Data

0.5

0.6

0.7

0.8

0.9

1

0 0.01 0.02 0.03 0.04 0.05

True

pos

itive

rate

False positive rate

k=1k=2k=4k=8

k=16k=32k=64

k-stable

F .: Global profile ROC curves for varying κ. In thepresence of severe undertraining (κ ≪ κstable), the system is notable to recognize most attacks and also reports several false posi-tives. However, as κ increases, detection accuracy improves, andapproaches that of the well-trained case (κ = κstable).

ping robustness plot presented in Figure .. As predicted, settingκ = 32, 64 leads to fairly accurate global profile selection, withthe resulting ROC curves approaching that of fully-trained pro-files. is means that even if the component of a web applicationhas received only a few requests (i.e., ), by leveraging a globalknowledge base it is possible to achieve effective attack detection.As a consequence, our approach can improve the effectiveness ofreal-world web application firewalls and web application anomalydetection systems.

Clearly, the detection accuracy will improve as more trainingsamples (e.g., , ) become available. However, the goal ofthis experiment was to evaluate such an improvement with a verylimited training set, rather than showing the detection maximumaccuracy achievable. From these results, we conclude that for ap-propriate values of κ, the use of a global knowledge base can providereasonably accurate detection performance even in the presence oftraining data scarcity.

One concern regarding the substitution of global profiles forlocal request parameters is that a global profile that was trained on

. A D W- A

another web application may not detect valid attacks against theundertrained parameter. Without this technique, however, recallthat a learning-based web application anomaly detector would oth-erwise have no effective model whatsoever, and therefore the un-dertrained parameter would be unprotected by the detection sys-tem (i.e., zero true positive rate). Furthermore, the ROC curvesdemonstrate that while global profiles are in general not as preciseas locally-trained models, they do provide a significant level of de-tection accuracy.

Note .. If global profiles were found to be as accurate as localprofiles, this would constitute an argument against anomaly detec-tion itself.

More precisely, with κ = 1, undertraining condition and sys-tem off, only . of the attacks are detected, overall, with around of false positives. On the other hand, with κ = 64 (under-training and system on), more than of the attacks are detectedwith less than . of false positives (vs., . of false positives inthe case of no undertraining and system off). erefore, we con-clude that, assuming no mistrust among the parties that share theglobal knowledge base, our approach is a useful technique to applyin the presence of undertrained models and, in general, in the caseof training data scarcity. Note that the last assumption is often sat-isfied because one physical deployment (e.g., one server protectedby our system) typically hosts several web applications under thecontrol of the same system administrator or institution.

erefore, we conclude that our approach is a useful techniqueto apply in the presence of undertrained models and, in general, incase of training data scarcity.

Note .. (Configuration parameters) Our methods provide ex-plicit guidance, based on the training data of the particular deploy-ment, to choose the value of the only parameter required to trade offaccuracy vs. length of the training phase. In addition, the “good-ness” of the approach with respect to different choices of such aparameter is evaluated on real-world data in Section .

In particular, the ROC curve shows how the sampling size (κ)affects the detection accuracy, thus offer some guidance to the user.

.. Training With Scarce Data

In addition, Figure shows that the system stabilizes for κ > 32,thus some more guidance about “what is a good value” is given.

. A D W- A

. Addressing Changes inWeb Applications

In this section, we describe our proposal [Maggi et al., c] tocope with FP due to changes in the modeled web applications. Re-call that detection is performed under the assumption that attackscause significant changes (i.e., anomalies) in the application behav-ior. us, any activity that does not fit the expected, learned modelsis flagged as malicious. is is true regardless of the type of systemactivity, thus it holds for other types of IDSs than web-based ones.

In particular, one issue that has not been well-studied is thedifficulty of adapting to changes in the behavior of the protectedapplications. is is an important problem because today’s web ap-plications are user-centric. at is, the demand for new servicescauses continuous updates to an application’s logic and its inter-faces.

Our analysis, described in Section .., reveals that significantchanges in the behavior of web applications are frequent. We referto this phenomenon as web application concept drift. In the contextof anomaly-based detection, this means that legitimate behaviormight be misclassified as an attack after an update of the applica-tion, causing the generation of false positives. Normally, whenevera new version of an application is deployed in a production envi-ronment, a coordinated effort involving application maintainers,deployment administrators, and security experts is required. atis, developers have to inform administrators about the changes thatare rolled out, and the administrators have to update or re-trainthe anomaly models accordingly. Otherwise, the amount of FPswill increase significantly. In [Maggi et al., c] we describea solution that makes these tedious tasks unnecessary. Our tech-nique examines the responses (HTML pages) sent by a web appli-cation. More precisely, we check the forms and links in these pagesto determine when new elements are added or old ones removed.is information is leveraged to recognizes when anomalous in-puts (i.e., HTTP requests) are due to previous, legitimate updates—changes— in a web application. In such cases, FPs are sup-pressed by automatically and selectively re-training models. More-over, when possible, model parameters can be automatically up-dated without requiring any re-training.

Note .. (Re-training) Often, a complete re-training of all the

.. Addressing Changes in Web Applications

models is expensive in terms of time; typically, it requires O(P )where P represents the number of HTTP messages required totrain a model. More importantly, such re-training is not alwaysfeasible since new, attack-free training data is unlikely to be avail-able immediately after the application has changed. In fact, to col-lect a sufficient amount of data the new version of the applicationmust be executed and real, legitimate clients have to interact withit in a controlled environment. Clearly, this task requires time andefforts. More importantly, those parts that have changed in theapplication must be known in advance.

Our technique focuses on the fundamental problem of detect-ing those parts of the application that have changed and that willcause FPs if no re-training is performed. erefore, the techniqueis agnostic with respect to the specific training procedure, which isIDS-specific and can be different from the one we propose.

e core of this contribution is the exploiting of HTTP re-sponses, which we show to contain important insights that can beeffectively leveraged to update previously learned models to takechanges into account. e results of applying our technique on real-world data demonstrate that learning-based anomaly detectors canautomatically adapt to changes, and by doing this, are able to reducetheir FPR without decreasing their DR significantly. Note that, ingeneral, relying on HTTP responses may lead to the issue of poi-soning of the models used to characterize them: this limitation isdiscussed in details in Section ... in comparison with the ad-vantages provided by our technique.

In Section .. the problem of concept drift in the context ofweb applications is detailed. In addition, we provide evidence thatit occurs in practice, motivating why it is a significant problem fordeploying learning-based anomaly detectors in the real world. ecore of our contribution is described in Section .. where we de-tail the technique based on HTTP response models that can beused to distinguish between legitimate changes in web applicationsand web-based attacks. A version of webanomaly incorporatingthese techniques has been evaluated over an extensive real-worlddata set, demonstrating its ability to deal with web application con-cept drift and reliably detect attacks with a low false positive rate.e results of this evaluation are discussed in Section ...

. A D W- A

.. Web Application Concept driftIn this section the notion of web application concept drift is de-fined. We rely upon the generalizedmodel of learning-based anomalydetectors of web attacks described in Section ... Secondly, evi-dence that concept drift is a problem that exists in the real world isprovided to motivate why it should be addressed.

... Changes inWeb Applications’ Behavior

In machine learning, changes in the modeled behavior are knownas concept drift [Schlimmer and Granger, ]. Intuitively, theconcept is the modeled phenomenon; in the context of anomaly de-tection it may be the structure of requests to a web server, the re-curring patterns in the payload of network packets, etc. us, vari-ations in the main features of the phenomena under considerationresult in changes, or drifts, in the concept. In some sense, the con-cept corresponds to the normal web application behavior (see alsoDefinition ..).

Although the generalization and abstraction capabilities ofmod-ern learning-based anomaly detectors are resilient to noise (i.e.,small, legitimate variations in the modeled behavior), concept driftis difficult to detect and to cope with [Kolter and Maloof, ].e reason is that the parameters of the models may stabilize todifferent values. For example, the string length model described inSection ... —and also used in webanomaly— learns the sam-ple mean and variance of the string lengths that are observed dur-ing training. In webanomaly, during detection, the Chebyshev in-equality is used to detect strings with lengths that significantly de-viate from the mean, taking into account the observed variance.As shown in Section .., the variance allows to account for smalldifferences in the lengths of strings that will be considered normal.

On the other hand, the mean and variance of the string lengthscan completely change because of legitimate and permanent modi-fications in the web application. In this case, the normal mean andvariance will stabilize, or drift, to different values. If appropriatere-training or manual updates are not performed, the model willclassify benign, new strings as anomalous. ese examples allow usto better define the web application concept drift.

Definition .. (Web Application Concept Drift) eweb appli-

.. Addressing Changes in Web Applications

cation concept drift is a permanent modification of the normal webapplication behavior.

Since the behavior of a web application is derived from the systemactivity I during normal operation, a permanent change in I causesthe concept drift. Changes in web applications can manifest them-selves in several ways. In the context of learning-based detectionof web attacks, those changes can be categorized into three groups:request changes, session changes, and response changes.

Request changes Changes in requests occur when an applicationis upgraded to handle different HTTP requests. ese changescan be further divided into two groups: parameter value changesand request structure changes. e former involve modifications ofthe actual value of the parameters, while the latter occur when pa-rameters are added or removed. Parameter renaming is the result ofremoval plus addition.

Example .. (Request change) A new version of a web forumintroduces internationalization (IN) and localization (LN). Be-sides handling different languages, IN and LN allow severaltypes of strings to be parsed as valid dates and times. For instance,valid strings for the datetime parameter are ‘3 May 2009 3:00’,‘3/12/2009’, ‘3/12/2009 3:00 PM GMT-08’, ‘now’. In the previous ver-sion, valid date-time strings had to conform to the regular expres-sion ‘[0-9]{1,2}/[0-9]{2}/[0-9]{4}’. A model with good gener-alization properties would learn that the field datetime is composedof numbers and slashes, with no spaces. us, other strings such as‘now’ or ‘3/12/2009 3:00 PM GMT-08’ would be flagged as anomalous.Also, in our example, tz and lang parameters have been added totake into account time zones and languages. To summarize, thenew version introduces two classes of changes. Clearly, the param-eter domain of datetime is modified. Secondly, new parameters areadded.

Changes in HTTP requests directly affect the request models:

. parameter value changes affect any models that rely on theparameters’ values to extract features. For instance, considertwo of the models used in the system described in [Kruegel

. A D W- A

et al., ]: m(char) and m(struct). e former models thestrings’ character distribution by storing the frequency of allthe symbols found in the strings during training, while thelatter models the strings’ structure as a stochastic grammar,using a HMM. In the aforementioned example, the INand LN introduce new, legitimate values in the parame-ters; thus, the frequency of numbers in m(char) changes andnew symbols (e.g., ‘-’, ‘[a-zA-Z]’ have to be taken into ac-count. It is straightforward to note that m(struct) is affectedin terms of new transitions introduced in the HMM by thenew strings.

. Request structure changesmay affect any type of requestmodel,regardless of the specific characteristics. For instance, if amodel for a new parameter is missing, requests that containthat parameter might be flagged as anomalous.

Session changes Changes in sessions occur whenever resource pathsequences are reordered, inserted, or removed. Adding or remov-ing application modules introduces changes in the session models.Also, modifications in the application logic are reflected in the ses-sion models as reordering of the resources invoked.

Example .. (Session change) Anew version of a web-based com-munity software grants read-only access to anonymous users (i.e.,without authentication), allowing them to display contents pre-viously available to subscribed users only. In the old version, le-gitimate sequences were ⟨/site, /auth, /blog⟩ or ⟨/site, /auth,/files⟩, where /site indicates the server-side resource that handlesthe public site, /auth is the authentication resource, and /blog and/files were formerly private resources. Initially, the probability ofobserving /auth before /blog or /files is close to one (since usersneed to authenticate before accessing private material). is is nolonger true in the new version, however, where /files|/blog|/authare all possible after /site.

Changes in sessions impact all models that rely on the sequenceof resources that are invoked during the normal operation of anapplication. For instance, consider the model m(sess) described in[Kruegel et al., ], which builds a probabilistic finite state au-tomaton that captures sequences of resource paths. New arcs must

.. Addressing Changes in Web Applications

be added to take into account the changes mentioned in the aboveexample. ese types of models are sensitive to strong changes inthe session structure and should be updated accordingly when theyoccur.

Response changes Changes in responses occur whenever an ap-plication is upgraded to produce different responses. Interface re-designs and feature addition or removal are example causes of changesin the responses. Response changes are common and frequent,since page updates or redesigns often occur in modern websites.

Example .. (Response change) A new version of a video shar-ing application introduces Web . features into the user interface,allowing for the modification of user interface elements without re-freshing the entire page. In the old version, relatively few nodes ofdocuments generated by the application contained client-side code.In the new version, however, many nodes of the document containevent handlers to trigger asynchronous requests to the applicationin response to user events. us, if a response model is not updatedto reflect the new structure of such documents, a large of numberof false positives will be generated due to legitimate changes in thecharacteristics of the web application responses.

... Concept Drift in the RealWorld

To understand whether concept drift is a relevant issue for real-world websites, we performed three experiments. For the first ex-periment, we monitored , public websites, including the AlexaTop ’s and other sites collected by queryingGooglewith popularterms extracted from the Alexa Top ’s. e goal was to identifyand quantify the changes in the forms and input fields of popularwebsites at large. is provides an indication of the frequency withwhich real-world applications are updated or altered.

First Experiment: Frequency of Changes Once every hour, wevisited one page for each of the , websites. In total, we col-lected ,, pages, comprising more than , snapshots foreach website, between January and April , . One tenthof the pages were manually selected to have a significant number offorms, input fields, and hyperlinks with GET parameters. By doing

. A D W- A

this, we gathered a considerable amount of information regardingthe HTTP messages generated by some applications. Examples ofthese pages are registration pages, data submission pages, or con-tact form pages. For the remaining websites, we simply used theirhome pages. Note that, the pages we selected may not be fully rep-resentative of the whole website. To overcome this limitation andto further confirm our intuition, we performed a more detailed ex-periment — described in Section ... — on the source code oflarge, real-world web applications.

For each website w, each page sample crawled at time t is asso-ciated with a tuple |F |(w)

t , |I|(w)t , the cardinality of the sets of forms

and input fields, respectively. By doing this, we collected samplesof the variables |F |w = |F |wt1 , . . . , |F |wtn , |I|w = |I|wt1 , . . . , |I|

wtn ,

with 0 < n < 1, 390. Figure . shows the relative frequency ofthe variables

XI = stdev(|I|(w1)), . . . , stdev(|I|(wk))

XF = stdev(|F |(w1)), . . . , stdev(|F |(wk)).

is demonstrates that a significant amount of websites exhibitvariability in the response models, in terms of elements modifiedin the pages, as well as request models, in terms of new forms andparameters. In addition, we estimated the expected time betweenchanges of forms and inputs fields, E[TF ] and E[TI ], respectively.In terms of forms, . of the websites drifted during the ob-servation period. More precisely, out of , websites have afinite E[TF ]. Similarly, . of the websites exhibited drifts inthe number of input fields, i.e.,E[TI ] < +∞ for websites. Fig-ure . shows the relative frequency of (b) E[TF ], and (d) E[TI ].E[TF ]. is confirms that a non-negligible portion of the websitesexhibit significantly frequent changes in the responses.

Second Experiment: Type of Changes We monitored in depththree large, data-centric web applications over several months: Ya-hoo! Mail, YouTube, and MySpace. We dumped HTTP responsescaptured by emulating user interaction using a custom, scriptableweb browser implemented with Html-Unit. Examples of these in-teractions are as follows: visit the home page, login, browse theinbox, send messages, return to the home page, click links, log out.

.. Addressing Changes in Web Applications

stdev(|F |)

Fre

quen

cy

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

050

100

150

200

(a) Changes of forms.E[T] between changes in |F | [hours]

Fre

quen

cy

0 100 200 300 400

050

100

150

(b) Avg. time between changes in |F |.

stdev(|I|)

Fre

quen

cy

0 10 20 30 40

010

020

030

040

0

(c) Changes of inputs.E[T] between changes in |I| [hours]

Fre

quen

cy

0 100 200 300 400

010

020

030

040

0

(d) Avg. time between changes in |I|

F .: Relative frequency of the standard deviation of thenumber of forms (a) and input fields (c). Also, the distribution ofthe expected time between changes of forms (b) and input fields(d) are plotted. A non-negligible portion of the websites exhibitschanges in the responses. No differences have been noticed be-tween home pages and manually-selected pages.

Manual inspection revealed some major changes in Yahoo! Mail.For instance, the most evident change consisted of a set of newfeatures added to the search engine (e.g., local search, refined ad-dress field in maps search), which manifested themselves as newparameters found in the web search page (e.g. to take into accountthe country or the ZIP code). User pages of YouTube were signif-icantly updated with new functionalities between and .

. A D W- A

For instance, the new version allows users to rearrange widgets intheir personal pages. To account for the position of each element,new parameters are added to the profile pages and submitted asyn-chronously whenever the user drags widgets within the layout. eanalysis on MySpace did not reveal any significant change. eresults of these two experiments show that changes in server-sideapplications are common. More importantly, these modificationsoften involve the way user data is represented, handled, and ma-nipulated.

ird Experiment: Abundance of Code Change We analyzedchanges in the requests and sessions by inspecting the code repos-itories of three of the largest, most popular open-source web ap-plications: WordPress, MovableType, and PhpBB. e goal was tounderstand whether upgrading a web application to a newer releaseresults in significant changes in the features that are used to deter-mine its behavior. In this analysis, we examined changes in thesource code that affect the manipulation of HTTP responses, re-quests, and session data. We used StatSVN, an open-source tool fortracking and visualizing the activity of SubVersioN (SVN) repos-itories (e.g., the number of lines changed or the most active de-velopers). We modified StatSVN to incorporate a set of heuristicsto compute approximate counts of the lines of code that, directlyor indirectly, manipulate HTTP session, request or response data.In the case of PHP Hypertext Preprocessor (PHP), examples repre-sentative of such lines include, but are not limited to, REQUEST| -SESSION| POST| GET|session [a-z]+|http |strip tags|addslashes. Inorder to take into account data manipulation performed throughlibrary functions (e.g.,WordPress’ custom Http class), we also gener-ated application-specific code patterns by manually inspecting andfiltering the core libraries. Figure . shows, over time, the linesof code in the repositories of PhpBB, WordPress, and MovableTypethat manipulate HTTP responses, requests and, sessions. eseresults show the presence of significant modifications in the webapplication in terms of relevant lines of code added or removed.More importantly, such modifications affect the way HTTP datais manipulated and, thus, impact request, response or session mod-els.

e aforementioned experiments confirm that the class of changes

.. Addressing Changes in Web Applications

400000

450000

500000

550000

600000

650000

700000

750000

800000

850000

900000

01/0

1/03

01/0

1/04

01/0

1/05

01/0

1/06

01/0

1/07

01/0

1/08

01/0

1/09

01/0

1/10

Line

s of

Cod

e

(a) PhpBB

280000

300000

320000

340000

360000

380000

400000

420000

440000

460000

480000

01/0

1/03

01/0

7/03

01/0

1/04

01/0

7/04

01/0

1/05

01/0

7/05

01/0

1/06

01/0

7/06

01/0

1/07

01/0

7/07

01/0

1/08

01/0

7/08

01/0

1/09

01/0

7/09

(b) WordPress

450000

500000

550000

600000

650000

700000

750000

800000

850000

01/0

4/06

01/0

7/06

01/1

0/06

01/0

1/07

01/0

4/07

01/0

7/07

01/1

0/07

01/0

1/08

01/0

4/08

01/0

7/08

01/1

0/08

01/0

1/09

01/0

4/09

(c) MovableType

F .: Lines of codes in the repositories of PhpBB, Word-Press, and MovableType, over time. Counts include only the codethat manipulates HTTP responses, requests and sessions.

. A D W- A

we described in Section ... is common in real-world web ap-plications. erefore, anomaly detectors for web applications mustincorporate procedures to prevent false alerts due to concept drift.In particular, a mechanism is needed to discriminate between legit-imate and malicious changes, and respond accordingly. Note that,this somehow coincides with the original definition of ID (i.e., dis-tinguishing among normal vs. malicious activity); however, our fo-cus is to recognizes changes in the application activity that couldlead to concept drift as opposed to spotting out changes in the ap-plication behavior (i.e., ID).

.. Addressing concept drift

In this section, a technique is presented to distinguish between le-gitimate changes in web application behavior and evidence of ma-licious behavior.

... Exploiting HTTP responses

An HTTP response is composed of a header and a body. Underthe hypothesis of content-type text/html, the body of HTTP re-sponses contains a set of links Li and forms Fi that refer to a set oftarget resources. Each form also includes a set of input fields Ii. Inaddition, each link li,j ∈ Li and form fi,j ∈ Fi has an associatedset of parameters.

A request qi to a resource ri returns a response respi. Fromrespi the client follows a link li,j or submits a form fi,j . Either ofthese actions generates a new HTTP request to the web applicationwith a set of parameter key-value pairs, resulting in the return of anew HTTP response to the client, ri+1, the body of which containsa set of links Li+1 and forms Fi+1. According to the specificationsof the HTTP, this process continues until the session has ended(i.e., either the user has explicitly logged out, or a timeout has oc-curred). We then define:

Definition .. (Candidate Resource Set) Given a resource rwithina web application, the candidate resource set of r is defined as:

.. Addressing Changes in Web Applications

candidates(r) := Lr ∪ Fr =

= {l1, l2, . . . , lN} ∪{f1, f2, . . . , fM}.

where:

• l(·) := resp.a(·).href,

• f(·) := resp.form(·).action,

• resp. is the plain/text body of the response resp,

• resp.<element>(·).<attribute> is the content of the attribute,

• <attribute>, of the HTML <element>(·).

Our key observation is that, at each step of a web applicationsession, a subset of the potential target resources is given exactly bythe content of the current resource. at is, given ri, the associatedsets of links Li and forms Fi directly encode a significant sub-setof the possible ri+1. Furthermore, each link li,j and form fi,j in-dicates a precise set of expected parameters and, in some cases, theset of legitimate values for those parameters that can be providedby a client.

Consider a hypothetical banking web application, where thecurrent resource ri = /account presented to a client is an accountoverview. e response respi may contain:

<body class=”account”>/* ... */

<form action=”/search” method=”get”><input name=”term” type=”text” value=”Insert term” /><input type=”submit” value=”Search!” />

</form>

<h1><a href=”/account/transfer”>Transfer funds</a></h1><a href=”/account/transfer/submit”>Submit transfer</a>

/* ... */

. A D W- A

<tr><td><a href=”/account/history?aid=328849660322”>See details</a></td></tr>

<tr><td><a href=”/account/history?aid=446825759916”>See details</a></td></tr>

/* ... */

<a href=”/logout”>Logout</a>

/* ... */

<h2>Feedback on this page?</h2><form action=”/feedback” method=”post” class=”feedback-form”>

<select name=”subject”><option>General</option><option>User interface</option><option>Functionality</option>

</select><textarea name=”message” /><input type=”submit” />

</form>

/* - */</body>

containing a set of links Li, represented as their URL, for instance:

Li =

/account/history?aid=328849660322,/account/history?aid=446825759916,/account/transfer/submit,/account/transfer,/logout

.

Forms are represented as their target action. For instance: Fi ={/feedback, /search}.

From Li and Fi, we can deduce the set of legal candidate re-sources for the next request ri+1. Any other resource would, bydefinition, be a deviation from a legal session flow through theweb application as specified by the application itself. For instance,it would not be expected behavior for a client to directly access/account/transfer/submit (i.e., a resource intended to submit anaccount funds transfer) from ri. Furthermore, for the resource

.. Addressing Changes in Web Applications

/account/history, it is clear that the web application expects to re-ceive a single parameter aidwith an account number as an identifier.

In the case of the form with target /feedback, let the associatedinput elements be:

<select name=”subject”><option>General</option><option>User interface</option><option>Functionality</option>

</select><textarea name=”message” />

It immediately follows that any invocation of the /feedback re-source from ri should include the parameters subject and message.In addition, the legal set of values for the parameter subject is givenby enumerating the enclosed <option /> tags. Valid values for thenew tz and datetime parameters mentioned in the example of Sec-tion ... can be inferred using the same algorithm. Any de-viation from these specifications could be considered evidence ofmalicious behavior.

In this section we described why responses generated by a webapplication constitute a specification of the intended behavior ofclients and the expected inputs to an application’s resources. As aconsequence, when a change occurs in the interface presented by aweb application, this will be reflected in the content of its responses.erefore, as detailed in the following section, an anomaly detec-tion system can take advantage of response modeling to detect andadapt to changes in monitored web applications.

... Adaptive response modeling

In order to detect changes in web application interfaces, the re-sponse modeling ofwebanomaly has been augmented with the abil-ity to build Li and Fi from the HTML documents returned to aclient. e approach is divided into two phases.

Extraction andparsing eanomaly detector parses eachHTMLdocument contained in a response issued by the web applicationto a client. For each <a /> tag encountered, the contents of thehref attribute is extracted and analyzed. e link is decomposedinto tokens representing the protocol (e.g., http, https, javascript,

. A D W- A

mailto), target host, port, path, parameter sequence, and anchor.Paths are subject to additional processing; for instance, relative pathsare normalized to obtain a canonical representation. is informa-tion is stored as part of an abstract document model for later pro-cessing.

A similar process occurs for forms. When a <form /> tag isencountered, the action attribute is extracted and analyzed as inthe case of the link href attribute. Furthermore, any <input />,<textarea />, or <select /> and <option /> tags enclosed by a partic-ular <form /> tag are parsed as parameters to the corresponding forminvocation. For <input /> tags, the type, name, and value attributesare extracted. For <textarea /> tags, the name attribute is extracted.Finally, for <select /> tags, the name attribute is extracted, as well asthe content of any enclosed <option /> tags. e target of the formand its parameters are recorded in the abstract document model asin the case for links.

Analysis and modeling e set of links and forms contained ina response is processed by the anomaly engine. For each link andform, the corresponding target resource is compared to the existingknown set of resources. If the resource has not been observed be-fore, a new model is created for that resource. e session modelis also updated to account for a potential transition from the re-source associated with the parsed document and the target resourceby training on the observed session request sequence.

For each of the parameters parsed from links or forms containedin a response, a comparison with the existing set of known param-eters is performed. If a parameter has not already been observed(e.g., the new tz parameter), a profile is created and associated withthe target resource model.

Any values contained in the response for a given parameter areprocessed as training samples for the associated models. In caseswhere the total set of legal parameter values is specified (e.g., <select />and <option /> tags), the parameter profile is updated to reflect this.Otherwise, the profile is trained on subsequent requests to the as-sociated resource.

As a result of this analysis, the anomaly detector is able to adaptto changes in session structure resulting from the introduction ofnew resources. In addition, the anomaly detector is able to adapt to

.. Addressing Changes in Web Applications

Client Anomaly detector Web app. server

qi

Parsing

Change or attack?

Li, Fi

qi

respirespi

qi+1 qi+1

F .: A representation of the interaction between the clientand the web application server, monitored by a learning-basedanomaly detector. After request qi is processed, the correspondingresponse respi is intercepted and link Li and forms Fi are parsed toupdate the request models. is knowledge is exploited as a changedetection criterion for the subsequent request qi+1.

changes in request structure resulting from the introduction of newparameters and, in a limited sense, to changes in parameter values.

... Advantages and limitations

Due to the response modeling algorithm described in the previoussection, an anomaly detector is able to automatically adapt to manycommon changes observed in web applications as modifications aremade to the interface presented to clients. Both changes in sessionand request structure such as those described in Section ... canbe accounted for in an automated fashion.

Example .. (IN and LN) eaforementionedmodificationis correctly handled as it consists in an addition of the tz parameterand a modification of the datetime parameter.

Furthermore, we claim that web application anomaly detectorsthat do not perform response modeling cannot reliably distinguishbetween anomalies caused by legitimate changes in web applica-tions and those caused by malicious behavior. erefore, as willbe shown in Section .., any such detector that solely monitorsrequests is more prone to false positives in the real world.

Clearly, the technique relies upon the assumption that the web ap-plication has not been compromised. Since the web application,and in particular the documents it generates, is treated as an ora-cle for whether a change has occurred, if an attacker were to com-promise the application in order to introduce a malicious change,

. A D W- A

the malicious behavior would be learned as normal by the detec-tor. Of course, in this case, the attacker would already have ac-cess to the web application. However, modern anomaly detectorslike webanomaly observes all requests and responses to and fromuntrusted clients, therefore, any attack that would compromise re-sponse modeling would be detected and blocked.

Example .. An attacker could attempt to evade the anomalydetector by introducing a malicious change in the HTTP responsesand then exploits the change detection technique that would inter-pret the new malicious request as a legit change.

For instance, the attacker could incorporate a link that contain aparameter used to inject the attack vector. To this end, the attackerwould have to gain control of the server by leveraging an existingvulnerability of the web application (e.g., a buffer overflow, a SQLinjection). However, the HTTP requests used by the attacker toexploit the vulnerability will trigger several models (e.g., the stringlength model, in the case of a buffer overflow) and, thus, will beflagged as anomalous.

In fact, our technique does not alter the ability of the anomalydetector to detect attacks. On the other hand, it avoids many falsepositives, as demonstrated in Section ....

Besides the aforementioned assumption, three limitations areimportant to note.

• e set of target resources may not always be statically deriv-able from a given resource. For instance, this can occur whenclient-side scripts are used to dynamically generate page con-tent, including links and forms. Accounting for dynamic be-havior would require the inclusion of script interpretation.is, however, has a high overhead, is complex to performaccurately, and introduces the potential for denial of serviceattacks against the anomaly detection system. For these rea-sons, such a component is not implemented in the currentversion of webanomaly, although further research is plannedto deal with dynamic behavior.

e threat model assumes that the attacker can interact with the web applica-tion only by sending HTTP requests.

.. Addressing Changes in Web Applications

• e technique does not fully address changes in the behaviorof individual request parameters in its current form. In caseswhere legitimate parameter values are statically encoded aspart of an HTML document, response modeling can directlyaccount for changes in the legal set of parameter values. Un-fortunately, in the absence of any other discernible changes inthe response, changes in parameter values provided by clientscannot be detected. However, heuristics such as detectingwhen all clients switch to a new observable behavior in pa-rameter values (i.e., all clients generate anomalies against aset of models in a similar way) could serve as an indicationthat a change in legitimate parameter behavior has occurred.

• e technique cannot handle the case where a resource is theresult of a parametrized query and the previous response hasnot been observed by the anomaly detector. In our expe-rience, however, this does not occur frequently in practice,especially for sensitive resources.

.. Experimental ResultsIn this section, we show that our techniques reliably distinguishbetween legitimate changes and evidence ofmalicious behavior, andpresent the resulting improvement in terms of detection accuracy.

e goal of this evaluation is twofold. We first show that con-cept drift in modeled behavior caused by changes in web applica-tions results in lower detection accuracy. Second, we demonstratethat our technique based on HTTP responses effectively mitigatesthe effects of concept drift. To this end, we adopted the trainingdataset described in Section ...

... Effects of concept drift

In the first experiment, we demonstrate that concept drift as ob-served in real-world web applications results in a significant nega-tive impact on false positive rates.

. webanomaly was trained on an unmodified, filtered data set.en, the detector analyzed a test data setQ to obtain a base-line ROC curve.

. A D W- A

. After the baseline curve had been obtained, the test data setwas processed to introduce new behaviors corresponding tothe effects of web application changes, such as upgrades orsource code refactoring, obtaining Qdrift. In this manner,the set of changes in web application behavior was explicitlyknown. In particular, as detailed in Table .:

sessions , new session flows were created by introduc-ing requests for new resources and creating request se-quences for both new and known resources that had notpreviously been observed;

parameters new parameter sets were created by introducing, new parameters to existing requests;

values the behavior of modeled features of parameter val-ues was changed by introducing , mutations of ob-served values in client requests.

For example, each sequence of resources

⟨/login, /index, /article⟩

might be transformed to

⟨/login, /article⟩.

Similarly, each request like /categories found in the trafficmight be replaced with /foobar. For new parameters, a setof link or form parameters might be updated by changing aparameter name and updating requests accordingly.It must be noted that in all cases, responses generated by theweb application were modified to reflect changes in clientbehavior. To this end, references to new resources were in-serted in documents generated by the web application, andboth links and forms contained in documents were updatedto reflect new parameters.

. webanomaly– without the HTTP response modeling tech-nique enabled – was then run over Qdrift to determine theeffects of concept drift upon detector accuracy.

.. Addressing Changes in Web Applications

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2

Tru

e po

sitiv

e ra

te

False positive rate

Detection accuracy (Q)Detection accuracy (Qdrift)

(a) Response modeling disabled.

0.5

0.6

0.7

0.8

0.9

1

0 0.05 0.1 0.15 0.2

Tru

e po

sitiv

e ra

te

False positive rate

Detection accuracy (Q)Detection accuracy (Qdrift)

(b) Response modeling enabled.

F .: Detection and false positive rates measured onQ andQdrift, with HTTP response modeling enabled in (b).

e resulting ROC curves are shown in Figure .a. e con-sequences of web application change are clearly reflected in the in-crease in false positive rate for Qdrift versus that for Q. Each newsession flow and parameter manifests as an alert, since the detectoris unable to distinguish between anomalies due to malicious behav-ior and those due to legitimate change in the web application.

... Change detection

e second experiment quantifies the improvement in the detectionaccuracy of webanomaly in the presence of web application change.As before, the detector was trained over an unmodified filtered dataset, and the resulting profiles were evaluated over bothQ andQdrift.In this experiment, however, the HTTP response modeling tech-nique was enabled.

Figure .b presents the results of analyzing HTTP responseson detection accuracy. Since many changes in the behavior of theweb application and its clients can be discovered using our responsemodeling technique, the false positive rate for Qdrift is greatly re-duced over that shown in Figure .a, and approaches that of Q,where no changes have been introduced. e small observed in-crease in false positive rate can be attributed to the effects of changesin parameter values. is occurs because a change has been in-troduced into a parameter value submitted by a client to the webapplication, and no indication of this change was detected on the

. A D W- A

C A FP RNew session flows , .New parameters , .

Modified parameters , , .Total , , .

Table .: Reduction in the false positive rate due to HTTP re-sponse modeling for various types of changes.

preceding document returned to the client (e.g., because no <select/> were found).

Table . displays the individual contributions to the reductionof the false positive rate due to the response modeling technique.Specifically, the total number of anomalies caused by each type ofchange, the number of anomalies erroneously reported as alerts,and the corresponding reduction in the false positive rate is shown.e results displayed were generated from a run using the optimaloperating point (., .) indicated by the knee of theROC curve in Figure .b. For changes in session flows and pa-rameters sets, the detector was able to identify an anomaly as beingcaused by a change in web application behavior in all cases. is re-sulted in a large net decrease in the false positive rate of the detectorwith response modeling enabled. e modification of parametersis more problematic, though; as discussed in Section ..., it isnot always apparent that a change has occurred when that changeis limited to the type of behavior a parameter’s value exhibits.

From the overall improvement in false positive rates, we con-clude that HTTP response modeling is an effective technique fordistinguishing between anomalies due to legitimate changes in webapplications and those caused by malicious behavior. Furthermore,any anomaly detector that does not do so is prone to generating alarge number of false positives when changes do occur in the mod-eled application. Finally, as it has been shown in Section ..,web applications exhibit significant long-term change in practice,and, therefore, concept drift is a critical aspect of web applicationanomaly detection that must be addressed.

.. Concluding Remarks

. Concluding Remarks

In this chapter we described in detail two approaches to mitigatetraining issues. In particular, we presented a technique to reduceFPs due to scarce training data and a mechanism to automaticallyrecognize changes in the monitored applications, such as code up-dates, and re-train the models without any human intervention.

First, we have described our efforts to cope with an issue thatmust be addressed by web application anomaly detection systemsin order to provide a reasonable level of security. e impact of thisissue is particularly relevant for commercial web-based anomaly de-tection systems and web application firewall, which have to operatein real-world environment where sufficient training data might beunavailable within a reasonable time frame.

We have proposed the use of global knowledge bases of well-trained, stable profiles to remediate a local scarcity of training databy exploiting global similarities in web application parameters. Wefound that although using global profiles does result in a small re-duction in detection accuracy, the resulting system, when given ap-propriate parameters, does provide reasonably precise modeling ofotherwise unprotected web application parameters.

A possible future extension of the described approach is to in-vestigate the use of other types of models, particularly HTTP re-sponse and session models. An additional line of future work is theapplication of different clustering methods in order to improve theefficiency of the querying procedure for high-cardinality knowledgebases.

Secondly, we have identified the natural dynamicity of web ap-plications as an issue that must be addressed by modern anomaly-based web application anomaly detectors in order to prevent in-creases in the false positive rate whenever the monitored web ap-plication is changed. We named this frequent phenomenon theweb application concept drift.

In [Maggi et al., c] we proposed the use of novel HTTPresponse modeling techniques to discriminate between legitimatechanges and anomalous behaviors in web applications. More pre-cisely, responses are analyzed to find new and previously unmodeledparameters. is information is extracted from anchors and formselements, and then leveraged to update request and session models.We have evaluated the effectiveness of our approach over an exten-

. A D W- A

sive real-world data set of web application traffic. e results showthat the resulting system can detect anomalies and avoid false alertsin the presence of concept drift.

We plan to investigate the potential benefits of modeling thebehavior of JavaScript code, which is becoming increasingly preva-lent inmodernweb applications. Also, additional, richer, andmedia-dependent response models must be studied to account for inter-active client-side components, such as Adobe Flash and MicrosoftSilverlight applications.

Network and Host Alert Correlation

Either as part of an IDS or as a post-processing step, the alert cor-relation task is meant to recognize relations among alerts. For in-stance, a simple correlation task is to suppress all the alerts regard-ing unsuccessful attacks. A more complex example envisions a largeenterprise network monitored by multiple IDSs, both host- andnetwork-based. Whenever a misbehaving application is detectedon a certain machine, the HIDS interacts with the NIDS that ismonitoring the network segment of that machine to, say, validatethe actual detection result. A more formal definition of alert corre-lation and alert correlation systems can be found in Section ...

In this section we focus on practical aspects of alert correlationby first presenting the model we used in our prototype tools. Basedon this model we concentrate on two challenging phases, namelyaggregation, which has to do with grouping similar alerts together,and correlation, which is the actual correlation phase. In particular,in Section . we investigate the use of fuzzy metrics to time-basedaggregation, which has been proposed recently as a simple but ef-fective aggregation criterion. Unfortunately, given a threshold of,say, 10s, this approach will fail if two alerts are at, say, 10.09s toeach other. In Section . we focus on the actual correlation task.We exploit non-parametric statistical tests to create a simple butrobust correlation system that is capable of recognizing series of re-

. N H A C

lated events without relying on previous knowledge. Last, in Sec-tion . we propose a methodology to compare alert correlationsystems which are as difficult as IDSs to evaluate.

.. Fuzzy Models and Measures for Alert Fusion

..IDS

.A1

...An

.Normalization .Prioritization

.Aggregation.Correlation.Verification.A′

F .: Simplified version of the correlation approach pro-posed in [Valeur et al., ].

. FuzzyModels andMeasures for Alert Fusion

In this section we focus on the aggregation of IDS alerts, an impor-tant component of the alert correlation process as shown on Fig-ure .. We describe our approach [Maggi et al., b], whichexploit fuzzy measures and fuzzy sets to design simple and robustalert aggregation algorithms. Exploiting fuzzy sets, we are able torobustly state whether or not two alerts are “close in time”, dealingwith noisy and delayed detections. A performance metric for theevaluation of fusion systems is also proposed. Finally, we evaluatethe fusion method with alert streams from anomaly-based IDS.

Alert aggregation becomes more complex when taking into ac-count anomaly detection systems, because no information on thetype or classification of the observed attack is available to any ofthe fusion algorithms. Most of the algorithms proposed in the cur-rent literature on correlation make use of the information regard-ing the matching attack provided by misuse detectors; therefore,such methods are inapplicable to purely anomaly based intrusiondetection systems. Although some approaches, e.g., [Portokalidisand Bos, ], incorporate techniques to correlate automatically-generated signatures, such techniques are aimed at creating moregeneral signatures in order to augment the DR. Instead, our focusis that of reducing the FPR by fusing more information sources.However, as described in Section ., since anomaly and misusedetection are symmetric, it is reasonable and noteworthy to try tointegrate different approaches through an alert fusion process.

. N H A C

Toward such goal, we explore the use of fuzzy measures [Wangand Klir, ] and fuzzy sets [Klir and Folger, ] to designsimple, but robust aggregation algorithms. In particular, we con-tribute to one of the key issues, that is how to state whether or nottwo alerts are “close in time”. In addition, uncertainties on bothtimestamp measurements and threshold setting make this processeven more difficult; the use of fuzzy sets allows us to precisely de-fine a time-distance criterion which “embeds” unavoidable errors(e.g., delayed detections).

.. Time-based alert correlationAs proposed in most of the reviewed literature, a first, naive ap-proach consists in exploiting the time distance between alerts foraggregation; the idea is to aggregate “near” alerts. In this sectionwe focus on this point, starting by the definition of “near”.

Time-distance aggregation might be able to catch simple sce-narios like remote attacks against remote applications vulnerabili-ties (e.g., web servers ). For instance, consider the scenario where,at time t0, an attacker violates a host by exploiting a vulnerabilityof a server application. An IDS monitoring the system recognizesanomalous activity at time tn = t0 + τn. Meanwhile, the attackermight escalate privileges and break through another application;the IDSwould detects another anomaly at th = t0+τh. In themostgeneral case tn is “close” to th (with tn < th), so if th − tn ≤ Tnear

the two alerts belong to the same attack.e idea is to fuse alerts if they are both close in time, raised

from the same IDS, and refer to the same source and destination.is intuition obviously needs to be detailed. First, the conceptof “near” is not precisely defined; secondly, errors in timestampingare not taken into account; and, a crisp time distance measure isnot robust. For instance, if |th − tn| = 2.451 and Tnear = 2.450the two alerts are obviously near, but not aggregated, because theabove condition is not satisfied. To overcome such limitations, wepropose a fuzzy approach to time-based aggregation.

In the following, we will use the well-known dot notation, asin object-oriented programming languages, to access a specific alertattribute: e.g., a.start ts indicates the value of the attribute start tsof the alert a. Moreover, we will use T(·) to indicate a thresholdvariable: for instance, Tnear is the threshold variable called (or re-

.. Fuzzy Models and Measures for Alert Fusion

Deg

ree

of m

emb

ersh

ip

Time (s)

Instantaneous alert

Crisp window

(a)

Deg

ree

of m

emb

ersh

ip

Time (s)

Fuzzy alert event

Fuzzy window

Closeness

(b)

F .: Comparison of crisp (a) and fuzzy (b) time-windows.In both graphs, one alert is fired at t = 0 and another alert occursat t = 0.6. Using a crisp time-window and instantaneous alerts(a), the distance measurement is not robust to neither delays norerroneous settings of the time-window size. Using fuzzy-shapedfunctions (b) provides more robustness and allows to capture theconcept of “closeness”, as implemented with the T-norm depictedin (b). Distances in time are normalized in [,] (w.r.t. the origin).

. N H A C

garding to) “near”.We rely on fuzzy sets for modeling both the uncertainty on the

timestamps of alerts and the time distance in a robust manner. Re-garding the uncertainty on measurements, we focus on delayed de-tections by using triangle-shaped fuzzy sets to model the occur-rence of an alert. Since the measured timestamp may be affectedby errors or delays, we extend the singleton shown in Figure .(a) with a triangle, as depicted in Figure . (b). We also take intoaccount uncertainty on the dimension of the aggregation window:instead of using a crisp window (as in Figure . (a)), we extend itto a trapezoidal fuzzy set, resulting in a more robust distance mea-surement. In both graphs, one alert is fired at t = 0 and anotheralert occurs at t = 0.6. Using a crisp time-window and instanta-neous alerts (Figure . (a)), the distance measurement is not ro-bust to neither delays nor erroneous settings of the time-windowsize. Using fuzzy-shaped functions (Figure . (b)) provides morerobustness and allows to capture the concept of “closeness”, as im-plemented with the T-norm depicted in Figure . (b). In Figure. distances in time are normalized in [,] (w.r.t. the origin).

Note that, Figure . compares two possible manners to modeluncertainty on alert timestamps: in Figure . (a) the alert is recordedat . seconds but the measurement may have both positive (in thefuture) and negative (in the past) errors. Figure . (b) is more real-istic because positive errors are not likely to happen (i.e., we cannot“anticipate” detections), while events are often delayed, especiallyin network environments. In comparison to our proposal of using“fuzzy timestamps”, the IDMEF describes event occurrences us-ing three timestamps (Create-, Detect-, and Analyzer-Time): this isobviously more generic and allows the reconstruction of the entireevent path from the attack to the analyzer that reports the alert.However, all timestamps are not always known: for example, theIDS might not have such a feature, thus the IDMEF Alerts cannotbe completely filled.

As stated above, the main goal is to measure the time distancebetween two alerts, a1 and a2 (note that, in the following, a2 occursafter a1). We first exemplify the case of instantaneous alerts, thatis ai.start ts = ai.end ts = ai.ts for i ∈ {1, 2}. To state whetheror not a1 is close to a2 we use a Tnear sized time window: in otherwords, an interval spreading from a1.ts to a1.ts + Tnear (Figure. (a) shows the described situation for a1.ts = 0, Tnear = 0.4,

.. Fuzzy Models and Measures for Alert Fusion

Deg

ree

of m

emb

ersh

ip

Time (s)

(a)

Deg

ree

of m

emb

ersh

ip

Time (s)

(b)

F .: Comparing two possible models of uncertainty ontimestamps of single alerts.

. N H A C

and a2.ts = 0.6: values have been normalized to place a1 alert inthe origin).

Extending the example shown in Figure . (a) to uncertainty inmeasurements is straightforward; let us suppose to have an averageuncertainty of . seconds on the measured (normalized w.r.t. theorigin) value of a2.ts: we model is as a triangle-shaped fuzzy set asthe one drawn in Figure . (b).

In the second place, our method takes into account uncertaintyregarding the thresholding of the distance between two events, mod-eled in Figure . (b) by the trapezoidal fuzzy window: the smooth-ing factor, . seconds, represents potential errors (i.e., the timevalues for which themembership function is neither nor ). Giventhese premises, the fuzzy variable “near” is defined by a T-norm[Klir and Folger, ] as shown in Figure . (b): the resultingtriangle represents the alerts overlapping in time. In the examplewe used min(·) as T-norm.

In the above examples, simple triangles and trapezoids havebeen used but more accurate/complex membership functions couldbe used as well. However, we remark here that trapezoid-like setsare conceptually different from triangles as the former have a mem-bership value of ; this means certainty on the observed/mea-sured phenomenon (i.e., closeness), while lower values mean uncer-tainty. Trapezoids-like functions should be used whenever a -certainty interval is known: for instance, in Figure . (b) if twoalerts occur within . seconds they are near at ; between .and . seconds, such certainty decreases accordingly.

Note .. (Extended crisp window) It is important to underlinethat, from a purely practical point of view, one may achieve thesame effect on alert fusion by building a more “flexible” crisp timewindow (e.g., by increasing its size by, say, ). However, froma theoretical point of view, it must be noted that our approach ex-plicitly models the errors that occur during the alert generation pro-cess. On the other hand, by increasing a time window of a certainpercentage would only give more flexibility, without giving any se-mantic definition of “near alerts”. ere is a profound differencebetween saying “ larger” and assigning a precise membershipfunction to a certain variable and defining T-norms for aggrega-tion.

.. Fuzzy Models and Measures for Alert Fusion

Deg

ree

of m

emb

ersh

ip

Time (s)

Measured start timestamp

Rea

l sta

rt t

imes

tam

p

Mea

sure

d e

nd

tim

esta

mp

F .: Non-zero long alert: uncertainty on measured times-tamps are modeled.

e approach can be easily generalized to take into accountnon-instantaneous events, i.e. ai.start ts < ai.end ts. In thiscase, alerts delays have to be represented by a trapezoidal fuzzy set.Note that, smoothing parameters are configurable values as well asfuzzy set shapes, in order to be as generic as possible w.r.t. the par-ticular network environment; in the most general case, such param-eters should be estimated before the system deployment. In Figure. the measured value of ai.start ts is 0.4, while ai.end ts = 0.8;moreover, a smoothing factor of . seconds is added as a negativeerror, allowing the real ai.start ts to be ..

... Alert pruning

As suggested by the IDMEF Confidence attribute both the IDSdescribed in Chapter and [Zanero and Savaresi, ; Zanero,b] — that we also use in the following experiments — fea-ture an attack belief value. For a generic alert a the attack beliefrepresents the deviation of the anomaly score, a.score, from thethreshold, T . e concept of anomaly score is typical of anomaly

. N H A C

detectors and, even if there is no agreement about its definition, itcan intuitively interpreted as an internal, absolute indicator of ab-normality. To complete our approach we consider the attack beliefattribute for alert pruning, after the aggregation phase.

Many IDSs, including the ones that we proposed, rely on prob-abilistic scores and thresholds in order to isolate anomalies fromnormal activity; thus, we implemented a first naive belief measure:

Bel(a) = |a.score− Tanomaly| ∈ [0, 1] (.)

We remark that a.score ∈ [0, 1]∧Tanomaly ∈ [0, 1] ⇒ Bel(a) ∈[0, 1]. Also note that in the belief theory [Wang and Klir, ;Klir and Folger, ] Bel(B) indicates the belief associated to thehypothesis (or proposition) B; with an abbreviate notation, we in-dicate Bel(a) meaning the belief of the proposition “the alert a isassociated to a real anomaly”. In this case, the domain of discourseis the set of all the alerts, which contains both the alerts that arereal anomalies and the alerts that are not.

e belief theory has been used to implement complex decisionsupport systems, such as [Shortliffe, ], in which a more com-prehensive belief model has been formalized taking into accountboth belief and misbelief. e event (mis)belief basically representsthe amount of evidence is available to support the (non-)occurrenceof that event. In a similar vein, we exploit both the belief and thea-priori misbelief, the FPR. e FPR tells how much the anomalydetector is likely to report false alerts (i.e., the a-priori probabilityof reporting false alerts, or the so called type I error), thus it is agood candidate for modeling the misbelief of the alert. e morethe FPR increases, the more the belief associated to an alert shoulddecrease. In the first place, such intuition can be formalized usinga linear-scaling function:

Bellin(a) = (1− FPR︸ ︷︷ ︸misbelief

)Bel(a) (.)

However, such a scaling function (see, dashed-line plot in Fig-ure .) makes the belief to decrease too fast w.r.t. the misbelief. AsFigure . shows, a smoother rescaling function is the following:

Belexp(a) = (1− FPR)Bel(a)eFPR (.)

.. Fuzzy Models and Measures for Alert Fusion

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

FPR

Bel

F .: A sample plot of Belexp(a) (with |a.score −Tanomaly|, FPR ∈ [0, 1]) compared to a linear scaling function,i.e. Bellin(a) (dashed line)

e above defined attack belief is exploited to further reducethe resulting set of alerts. Alerts are reported only if

ai.attack belief > Tbel

where Tbel must be set to reasonable values according to the par-ticular environment (in our experiments we used .). e attackbelief variable may be modeled using fuzzy variables with, for ex-ample, three membership functions labeled as “Low”, “Medium”,and “High” (IDMEF uses the same three-values assignment foreach alert Confidence attribute and we used crisp numbers for it),but this has not been implemented yet in our working prototype.

Let us now to summarize the incremental approaches; the first“naive time-distance” uses a crisp window (as in Figure . (a)) togroups alerts w.r.t. their timestamps. e second approach, called“fuzzy time-distance”, extends the first method: it relies on fuzzysets (as shown in Figure . (b)) to take into account uncertaintyon timestamping and window sizing. e last version, “fuzzy time-distance with belief-based alert pruning”, includes the last explainedcriterion. In the following we will show the results of the proposedapproaches on a realistic test bench we have developed using alertsgenerated by our IDS on the complete IDEVAL testing dataset(both host and network data).

. N H A C

. Mitigating Evaluation Issues

is section investigates the problem of testing an alert fusion en-gine. In order to better understand the data we used, the experi-mental setup is here detailed and, before reporting the results ob-tained with the fuzzy approach introduced, we also briefly sketchthe theoretical foundations, the accuracy, and the peculiarities ofthe overall implementation of the IDSs used for gathering the alerts.At the end of the section, we compare the results of the aggregationphase in comparison with an analysis of the accuracy of an alterna-tive fusion approach [Qin and Lee, ].

.. A common alert representatione output format of our prototypes has been designed with ID-MEF in mind, to make the normalization process — the first stepFigure . — as straightforward as possible. Being more precise,alerts are reported by our tools as a modified version of the origi-nal Snort [Roesch, ] alert format, in order to take into accountboth the peculiarities of the anomaly detection approach, namelythe lack of attacks name, and suggestions from the IDMEF datamodel. In this way, our testing tools can easily implement conver-sion procedures from and to IDMEF.

Basically, we represent an alert as a tuple with the following attributes:

(ids id, src, dst, ids type, start ts, end ts, attack class, attack bel,target resource)

e ids id identifies the IDS, of type ids type (host or network),that has reported the alert; src and dst are the IP source and desti-nation addresses, respectively; start ts and end ts are the detectionNetwork Time Protocol (NTP) [Mills, ] timestamps: they arebothmandatory, even if the alert duration is unknown (i.e., start tsequals end ts). e discrete attribute attack class (or name) indi-cates the name of the attack (if known); anomaly detectors cannotprovide such an information because recognized attacks names arenot known, by definition. Instead, anomaly detectors are able toquantify “how anomalous an activity is” so we exploited this charac-teristic and defined the attack belief. Finally, the target resource

.. Mitigating Evaluation Issues

attribute stores the TCP port (or service name), if known, or theprogram begin attacked. An example instance is the following one:

(127.0.0.1-z7012gf, 127.0.0.1, 127.0.0.1, H, 0xbc723b45.0xef449129,0xbc723b45.0xff449130, -, 0.141044, fdformat(2351))

e example reports an attack detected from an host IDS (ids -type = H), identified by 127.0.0.1-z7012gf: the attack against fdformatis started atNTP-second xbcb (picosecond xef) andfinished at NTP-second xbcb (picosecond xff); theanalyzer detected the activity as an attack with a belief of ..An equivalent example for the network IDS (ids type = N) anomalydetector would be something like:

(172.16.87.101/24-a032j11, 172.16.87.103/24, 172.16.87.100/24, N,0xbc723b45.0xef449129, 0xbc723b45.0xff449130, -, 0.290937, smtp)

Here the attack is against the smtp resource (i.e., protocol) andthe analyzer believes there is an attack at ..

.. Proposed EvaluationMetricsAlert fusion is a relatively new problem, thus evaluation techniquesare limited to a few approaches [Haines et al., ]. e develop-ment of solid testing methodologies is needed from both the theo-retical and the practical points of view (a problem shared with IDStesting).

e main goal of fusion systems is to reduce the amount ofalerts the security analyst has to check. In doing this, the globalDR should ideally not decrease while the global FPR should be re-duced as much as possible. is suggests that the first sub-goal offusion systems is the reduction of the global FPR (for instance, byreducing the total number of alerts fired by source IDS througha rejection mechanism). Moreover, the second sub-goal is to limitthe decreasing of the global DR. Let us now formalize the conceptswe just introduced.

We indicate with Ai the alert set fired by the i-th IDS. Analert stream is actually a structure ⟨Ai,≺⟩ over Ai, where the binaryrelation ≺ means “occurs before”. More formally

∀a, a′ ∈ Ai : a ≺ a′ ⇒ a.start ts < a′.start ts.

. N H A C

with a.start ts, a′.start ts ∈ R+. We assume that common oper-ations between sets such as the union ∪ preserves the order or, oth-erwise, that the order can be recomputed; hence, given the unionbetween two alert sets Ak = Ai ∪Aj , the alert stream (i.e., orderedset) can be always reconstructed.

In the second place, we define the two functions d : A × O 7→[0, 1], f : A × O 7→ [0, 1] representing the computation of theDR and the FPR, respectively, that is: DRi = d(Ai, O), FPRi =

f(Ai, O). e setO contains each and every true alert; O is a partic-ular instance of O containing alerts occurring in the system underconsideration (i.e., those listed in the truth file). We can now definethe global DR, DRA, and the global FPR, FPRA, as:

A =∪i∈I

Ai (.)

DRA = d(A, O

)(.)

FPRA = f(A, O

)(.)

e output of a generic alert fusion system, is another alertstream A′ with |A′| ≤ |A|; of course, |A′| = |A| is a degenera-tive case. To measure the “impact” of an alert fusion system wedefine the Alert Reduction Rate (ARR) which quantifies:

ARR =|A′||A|

. (.)

e aforementioned sub-goals can now be formalized into twoperformance metrics. e ideal correlation system would maximizeboth ARR and DRA′

DRA; meanwhile, it would minimize FPRA′

FPRA. Of

course, ARR = 1 and ARR = 0 are degenerative cases. Low (i.e.,close to 0.0) FPRA′

FPRAmeans that the fusion system has significantly

decreased the original FPR; in a similar vein, high (i.e., close to1.0) DRA′

DRAmeans that, while reducing false positive alerts, the loss

of detection capabilities is minimal.erefore, a fusion system is better than another if it has both

a greater DRA′DRA

and a lower FPRA′FPRA

rate than the latter. is crite-rion by no means has to be considered as complete or exhaustive.

.. Mitigating Evaluation Issues

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20.200

0.30

0.400

0.50

0.600

0.70

0.800

0.90

1.000

Alert Reduction Rate

Glo

bal D

R

Best performances

Average performances

Poor performances

(a)

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

Best performances

Average performances

Poor performances

Alert Reduction Rate

Glo

bal F

PR

(b)

F .: Hypothetical plots of (a) the global DR, DRA′ , vs.ARR and (b) the global FPR, FPRA′ , vs. ARR.

. N H A C

In addition, it is useful to compare DRA′ and FPRA′ plots, vs.ARR, of different correlation systems obtaining diagrams like theone exemplified in Figure .; this gives a graphical idea of whichcorrelation algorithm is better. For instance, Figure . show thatthe algorithm labeled as Bestperformances is better than the others,because it shows higher FPRA′ reduction while DRA′ does notsignificantly decrease.

Note .. (Configuration parameters) One limitation of our ap-proach is that it relies on the attack belief threshold. Currently,there is no rigorous procedure for deriving it in practice. In our ex-periments, we run the aggregation algorithm several times on thesame data-set under the same conditions until acceptable values ofDR and FPR.

Similarly, the current version of our algorithm relies on alphacut values for deciding whether or not two alerts can be fused ornot. However, note that this is different from setting thresholds ontime, e.g., fusing alerts that are delayed, say, seconds. In fact,such a threshold would have direct impact on the fusion procedure;instead, different values of alpha cuts and shapes of trapezoid fuzzysets have a much smoother effects on the results. In addition, a userof our system may not be aware of what does it means for two alertsto be, say, -seconds-close in time; it might be easier for the userto just configure the system to fuse alerts that are -close.

.. Experimental ResultsIn these experiments we used two prototypes for network- andhost-based anomaly detection. In particular, we use the system de-scribed in Chapter (host-based) and [Zanero and Savaresi, ;Zanero, b] (network-based). ese tools were used to gener-ate alert streams for testing the three variants of the fuzzy aggre-gation algorithm we proposed. e same tools were also used togenerate alerts data for testing our correlation algorithm detailedin Section ..

Because of the well known issues of IDEVAL and to avoid bi-ased interpretation of the results, in our testing we used the IDE-VAL dataset with the following simplification: we just tried to fusethe stream of alerts coming from a HIDS sensor and a NIDS sen-sor, which is monitoring the whole network. To this end, we ran

.. Mitigating Evaluation Issues

the two prototypes on the whole IDEVAL testing dataset, us-ing the network dumps and the host-based logs from pascal. Weran the NIDS prototype on tcpdump data and collected alerts forattacks against the host pascal.eyrie.af.mil. eNIDS also gener-ated alerts related to other hosts. Using the HIDS prototypewe generated alerts from the dumps of the host pascal.eyrie.-af.mil. e NIDS was capable of detecting almost of the at-tacks with less than . of false positives; the HIDS performseven better with a DR of and . of false positives. How-ever, in this experiment we focus on the aggregated results ratherthan the detection capabilities of the single system.

In the following, we use this shorthand notation: Net is thesubstream of all the alerts generated by the NIDS. HostP is thesubstream of all the alerts generated by theHIDS installed on pascal.-eyrie.af.mil, while NetP regards all the alerts (with pascal as atarget) generated by the NIDS; finally, NetO = Net\NetP indi-cates all the alerts (with all but pascal as a target) generated by theNIDS.

Using the data generated as described above, and the metricsproposed in Section .., we compared three different versions ofthe alert aggregation algorithms described in Section .. In par-ticular we compare the use of crisp time-distance aggregation, theuse use of a simple fuzzy time-distance aggregation; and finally, theuse of attack belief for alert pruning.

Numerical results are plotted in in Figure . for different val-ues of ARR. As we discussed in Section ., Figure . (a) refersto the reduction of DR while Figure . (b) focuses on FPR. DRA′

and FPRA′ were calculated using the complete alert stream, net-work and host, at different values of ARR. e values of ARR areobtained by changing the parameters values: in particular, we setTbel = 0.66, the alpha cut of Tnear to 0.5, the window size to 1.0seconds, and varied the smoothing of the trapezoid between 1.0and 1.5 seconds, and the alert delay between 0 and 1.5 seconds. Itis not useful to plot the increase in false negatives, as it can be easilydeduced from the decrease in DR.

e last aggregation algorithm, denoted as “Fuzzy (belief )”, showsbetter performances since the DRA′ is always higher w.r.t. theother two aggregation strategies; this algorithm also causes a signif-icant reduction of the FPRA′ . Note that, taking into account theattack belief attribute makes the difference because it avoids true

. N H A C

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20.7

0.75

0.8

0.85

0.9

0.95

1CrispFuzzyFuzzy (belief)

Alert Reduction Rate

Glo

bal D

R

(a)

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

Alert Reduction Rate

Glo

bal F

PR

CrispFuzzyFuzzy (belief)

(b)

F .: Plot of theDRA′ (a) andFPRA′ (b) vs. ARR. “Crisp”refers to the use of the crisp time-distance aggregation; “Fuzzy” and“Fuzzy (belief )” indicates the simple fuzzy time-distance aggrega-tion and the use of the attack belief for alert discarding, respec-tively.

.. Mitigating Evaluation Issues

positives to be discarded; on the other hand, real false positive arenot reported in the output alert stream because of their low belief.

It is difficult to properly compare our approach with other otherfusion approaches proposed in the reviewed literature, because thelatter were not specifically tested from the aggregation point ofview, separately from the correlation one. Since the prototypesused to produce results are not released, we are only able to compareour approach with the overall fusion systems presented by others.

. N H A C

. Using Non-parametric Statistical Tests

In this section we analyze the use of different types of statisticaltests for the correlation of anomaly detection alerts. We show thatthe technique proposed in [Qin and Lee, ] and summarizedin Section ., one of the few proposals that can be extended tothe anomaly detection domain, strongly depends on good choicesof a parameter which proves to be both sensitive and difficult toestimate. Rather than using the Granger Causality Test (GCT) wepropose alternative approach [Maggi and Zanero, ] based on aset of simpler statistical tests. We proved that our criteria work wellon a simplified correlation task, without requiring complex config-uration parameters. More precisely, in [Maggi and Zanero, ]we explored the use of statistical causality tests, which have been pro-posed for the correlation of IDS alerts, and which could be appliedto anomaly based IDS as well. We redefine the problem in termsof a simpler statistical test, and experimentally validate it.

.. eGranger Causality TestWe tested the sensitivity of the GCT to the choice of two parame-ters: the sampling time of the time series that represent the alerts,w, and the time lag l, i.e., the order of the AR model being fit-ted. In [Qin and Lee, ] the sampling time was arbitrarilyset to w = 60s, while the choice of l is not documented. How-ever, our experiments show that the choice of these parameters canstrongly influence the results of the test. As explained in [Maggiet al., b], other values of l lead to awkward results.

We used the same dataset described in Section .., with thefollowing approach: we tested if NetP has any causal relationship

with HostP ; we also tested HostP �� NetP . Since only a re-duced number of alerts was available and since it was impossible toaggregate alerts along the “attack name” attribute (unavailable onanomaly detectors), we preferred to construct only two, large timeseries: one fromNetP (denoted asNetP (k)) and one fromHostP(denoted as HostP (k)). In the experiment reported in [Qin andLee, ] the sampling time, w, has been fixed at seconds, al-

In the following we will use the symbol “ ” to denote “Granger-causes” so,for instance, “A B” has to be read as “A causes B”; the symbol “�� ” indicates thenegated causal relationship, i.e., “does not Granger-cause”.

.. Using Non-parametric Statistical Tests

though we tested different values from to seconds. In oursimple experiment, the expected result is that NetP HostP ,and that HostP NetP (the indicates “causality” while isits negation).

Since the first experiment (w = 60) led us to unexpected results(i.e., using a lag, l, of 3 minutes, both NetP (k)�� HostP (k), andvice versa) we decided to plot the test results (i.e., p-value and GCI)vs. both l and w. In Figure . (a) l is reported in minutes whilethe experiment has been performed with w = 60s; the dashed lineis the p-value of the test “NetP (k) HostP (k)”, the solid line isopposite one. As one may notice, neither the first nor the secondtest passed, thus nothing can be concluded: fixing the test signifi-cance at α = 0.20 is the only way for refusing the null hypothesis(around l ≃ 2.5 minutes) to conclude both that NetP HostP ”and that HostP�� NetP ; the GCI plotted in Figure . (b) con-firms the previous result. However, for other values of l the situa-tion changes leading to opposite results.

Figure . shows the results of the test for w = 1800 secondsand l ∈ [50, 300] minutes. If we suppose a test significance of α =0.05 (dotted line), for l ≃ 120 the result is that HostP NetPwhileNetP�� HostP , the opposite of the previous one. Moreover,for other values of l the p-value leads to different results.

e last experiment results are shown in Figure .: for l >230 minutes, one may conclude that NetP HostP and HostP

�� NetP ; i.e., the expected result, even if the p-value for the secondtest is close to α = 0.05.

e previous experiments show only that the GCT failed on thedataset we generated, not that it does not work well in general. Itmay be the case that it is not suitable for “blind” alerts (i.e., withoutany information about the attack) reported by anomaly detectors:in fact, [Qin and Lee, ] used time series built from hyper-alerts resulting from an aggregation along all attributes: instead,in the anomaly-based case, the lack of attack names does not allowsuch hyper-alerts to be correctly constructed. In any case, the testresult significantly depends on l (i.e., the test is asymptotic); thismeans that an appropriate choice of l will be required, dependingon specific environmental conditions. e optimal l is also likely tochange over time in a given environment.

A possible explanation is that the GCT is significant only ifboth the linear regression models are optimal, in order to calculate

. N H A C

0 10 20 30 400.

00.

20.

40.

60.

81.

0

Lag [minutes]

p−Va

lue

(a)

0 10 20 30 40

0.0

0.5

1.0

1.5

2.0

Lag [minutes]

Gra

nger

Cau

salit

y In

dex

(b)

F .: p-value (a) and GCI (b) vs. l (in minutes) for the firstGCT experiment (w = 60.0 seconds): “NetP (k) HostP (k)”(dashed line), “HostP (k) NetP (k)” (solid line).

.. Using Non-parametric Statistical Tests

50 100 150 200 250

0.0

0.1

0.2

0.3

0.4

Lag [minutes]

p−Va

lue

(a)

50 100 150 200 250

24

68

Lag [minutes]

Gra

nger

Cau

salit

y In

dex

(b)

F .: p-value (a) and GCI (b) vs. l (in minutes) for thefirst Granger causality test experiment (w = 1800.0 seconds):“NetP (k) HostP (k)” (dashed line), “HostP (k) NetP (k)”(solid line).

. N H A C

the correct residuals. If we use the Akaike Information Criterion(AIC) [Akaike, ] to estimate the optimal time lag l over dif-ferent windows of data, we find out that p wildly varies over time,as it is shown in Figure .. e fact that there is no stable optimalchoice of l, combined with the fact that the test result significantlydepends on it, makes us doubt that the Granger causality test is aviable option for general alert correlation. e choice of w seemsequally important and even more difficult to perform, except byguessing.

Of course, our testing is not conclusive: the IDEVAL alert setsmay simply not be adequate for showing causal relationships. An-other, albeit more unlikely, explanation, is that the GCT test maynot be suitable for anomaly detection alerts: in fact, in [Qin andLee, ] it has been tested on misuse alerts. But in fact there arealso theoretical reasons to doubt that the application of the Grangertest can lead to stable, good results. First, the test is asymptoticw.r.t. l meaning that the results reliability decreases as l increasesbecause of the loss of degrees of freedom. Second, it is based on thestrong assumption of linearity in the auto-regressive model fittingstep, which strongly depends on the observed phenomenon. In thesame way, the stationarity assumption of the model does not alwayshold.

.. Modeling alerts as stochastic processesInstead of interpreting alert streams as time series (as proposed bythe GCT-based approach), we propose to change point of view byusing a stochastic model in which alerts are modeled as (random)events in time. is proposal can be seen as a formalized extensionof the approach introduced in [Valeur et al., ] (see Figure .),which correlates alerts if they are fired by different IDS within a“negligible” time frame, where “negligible” is defined with a crisp,fixed threshold.

For simplicity, once again we describe our technique in the sim-ple case of a single HIDS and a single NIDS which monitors thewhole network. e concepts, however, can be easily generalized totake into account more than two alert streams, by evaluating themcouple by couple. For each alert, we have three essential informa-tion: a timestamp, a “target” host (fixed, in the case of the HIDS,to the host itself ), and the generating sensor (in our case, a binary

.. Using Non-parametric Statistical Tests

100 150 200 250 300

0.0

0.2

0.4

0.6

0.8

1.0

Lag [minuti]

p−V

alue

(a)

100 150 200 250 300

2.0

2.5

3.0

3.5

4.0

4.5

Lag [minuti]

Gra

nger

Cau

salit

y In

dex

(b)

F .: p-value (a) and GCI (b) vs. l (in minutes) for thefirst Granger causality test experiment (w = 3600.0 seconds):“NetP (k) HostP (k)” (dashed line), “HostP (k) NetP (k)”(solid line).

. N H A C

0 2 4 6 8 10 12 14

26

10

Data range (10^3)

Opt

imal

p

F .: e optimal time lag p = l given by the AIC criterionstrongly varies over time.

value).With a self-explaining notation, define the following random

variables: TNetP are the arrival times of network alerts in NetP(TNetO, THostP are similarly defined); εNetP (εNetO) are the de-lays (caused by transmission, processing and different granularity indetection) between a specific network-based alert regarding pascal(not pascal) and the corresponding host-based one. e actual val-ues of each T(·) is nothing but the set of timestamps extracted fromthe corresponding alert stream. We reasonably assume that εNetP

and TNetP are stochastically independent (the same is assumed forεNetO and TNetO). Figure . shows how delays between networkand host alerts are calculated.

In an ideal correlation framework with two equally perfect IDSwith a DR and FPR, if two alert streams are correlated(i.e., they represent independent detections of the same attack oc-currences by different IDSs [Valeur et al., ]), they also are“close” in time. NetP and HostP should evidently be an exam-ple of such a couple of streams. Obviously, in the real world, somealerts will be missing (because of false negatives, or simply becausesome of the attacks are detectable only by a specific type of de-tector), and the distances between related alerts will therefore havesome higher variability. In order to account for this, we can “cut off ”alerts that are too far away from a corresponding alert in the othertime series, presuming them to be singletons. In our case, know-ing that single attacks did not last more than 400s in the originaldataset, we tentatively set a cutoff threshold at this point.

Under the given working assumptions, the proposed stochasticmodel is used to formalize the correlation problem as a set of twostatistical hypothesis tests:

.. Using Non-parametric Statistical Tests

TNetP

THostP

εNetP,1 εNetP,2 εNetP,3

. . .

F .: How delays between network and host alerts are cal-culated.

H0 H1

THostP = TNetP + εNetP vs. THostP = TNetP + εNetP

(.)

THostP = TNetO + εNetO vs. THostP = TNetO + εNetO

(.)Let {ti,k} be the observed timestamps of Ti, ∀i ∈ {HostP ,

NetP , NetO}, the meaning of the first test is straightforward:within a random amount of time, εNetP , the occurring of a hostalert, tHostP,k, is preceded by a network alert, tNetP,k. If this doesnot happen for a statistically significant amount of events, the testresult is that alert stream TNetP is uncorrelated to THostP ; in thiscase, we have enough statistical evidence for refusing H1 and accept-ing the null one. Symmetrically, refusing the null hypothesis ofthe second test means that the NetO alert stream (regarding to allhosts but pascal) is correlated to the alert stream regarding pascal.

Note that, the above two tests are strongly related to each other:in an ideal correlation framework, it cannot happen that both “NetPis correlated to HostP ” and “NetO is correlated to HostP ”: thiswould imply that the network activity regarding to all hosts butpascal (which raisesNetO) has to dowith the host activity of pascal(which raises HostP ) with the same order of magnitude of NetP ,that is an intuitively contradictory conclusion. erefore, the sec-ond test acts as a sort of “robustness” criterion.

From our alerts, we can compute a sample of εNetP by simplypicking, for each value in NetP , the value in HostP which is clos-est, but greater (applying a threshold as defined above). We can dothe same for εNetO, using the alerts in NetO and HostP .

. N H A C

HostP − NetP

e_NetPD

ensi

ty

0 100 200 300 400

0.00

00.

001

0.00

20.

003

0.00

40.

005

●●

●●●

●●●●

●●

● ●●●●

●●●●

●●●

●●

●●

● ●

100 200 300 400

5010

015

020

025

030

035

0

Estimator

e_N

etP

HostP − NetO

e_NetO

Den

sity

0 100 200 300 400

0.00

00.

001

0.00

20.

003

0.00

40.

005

●●●●

●●●●●

●●

●●●

●●●●●● ●

●●

●●

●●

100 200 300 400

010

020

030

0

Estimatore_

Net

O

F .: Histograms vs. estimated density (red dashes) andQ-Q plots, for both fO and fP .

e next step involves the choice of the distributions of the ran-dom variables we defined above. Typical distributions used formodeling random occurrences of timed events fall into the fam-ily of exponential Probability Density Functions (PDFs) [Pestman,]. In particular, we decided to fit them with Gamma PDFs,because our experiments show that such a distribution is a goodchoice for both the εNetP and εNetO.

e estimation of the PDF of εNetP , fP := fεNetP, and εNetO,

fO := fεNetO, is performed using thewell knownMaximumLikelihood

(ML) technique [Venables and Ripley, ] as implemented inthe GNU S package: the results are summarized in Figure .. fPand fO are approximated by

.. Using Non-parametric Statistical Tests

HostP − NetP

e_NetP

Den

sity

10 20 30 40 50

0.00

0.02

0.04

0.06

HostP − NetO

e_NetOD

ensi

ty

0 10 20 30 40 50 60

0.00

0.02

0.04

0.06

F .: Histograms vs. estimated density (red dashes) forboth fO and fP (IDEVAL )

Gamma(3.0606, 0.0178)

Gamma(1.6301, 0.0105)

respectively (standard errors on parameters are 0.7080, 0.0045 forfP and 0.1288, 0.009 for fO). From now on, the estimator of agiven density f will be indicated as f .

Figure . shows histograms vs. estimated density (red, dashedline) and quantile-quantile plots (Q-Q plots), for both fO and fP .We recall that Q-Q plots are an intuitive graphical “tool” for com-paring data distributions by plotting the quantile of the first distri-bution against the quantile of the other one.

Considering that the samples sizes of ε(·) are around , Q-Q plots empirically confirms our intuition: in fact, fO and fP areboth able to explain real data well, within inevitable but negligibleestimation errors. Even if fP and fO are both Gamma-shaped, itmust be noticed that they significantly differ in their parametriza-tion; this is a very important result since it allows to set up a propercriterion to decide whether or not εNetP and εNetO are generatedby the same phenomenon.

. N H A C

Given the above estimators, a more precise and robust hypothe-ses test can be now designed. e Test . and . can be mappedinto two-sided Kolmogorov-Smirnoff (KS) tests [Pestman, ],achieving the same result in terms of decisions:

H0 H1

εNetP ∼ fP vs. εNetP ∼ fP(.)

εNetO ∼ fO vs. εNetO ∼ fO (.)where the symbol∼means “has the same distribution of ”. Since

we do not know the real PDFs, estimators are used in their stead.We recall that the KS-test is a non-parametric test to compare asample (or a PDF) against a PDF (or a sample) to check how muchthey differs from each other (or how much they fit). Such testscan be performed, for instance, with ks.test() (a GNU R native pro-cedure): resulting p-values on IDEVAL are 0.83 and 0.03,respectively.

Noticeably, there is a significant statistical evidence to acceptthe null hypothesis of Test .. It seems that the ML estimationis capable of correctly fitting a Gamma PDF for fP (given εNetP

samples), which double-checks our intuition about the distribution.e same does not hold for fO: in fact, it cannot be correctly esti-mated, with a Gamma PDF, from εNetO. e low p-value for Test. confirms that the distribution of εNetO delays is completelydifferent than the one of εNetP . erefore, our criterion doest notonly recognize noisy delay-based relationships among alerts streamif they exists; it is also capable of detecting if such a correlation doesnot hold.

We also tested our technique on alerts generated by our NIDSand HIDS running on IDEVAL (limiting our analysis to thefirst four days of the first week), in order to cross-validate the aboveresults. We prepared and processed the data with the same proce-dures we described above for the dataset. Starting from al-most the same proportion of host/net alerts against either pascalor other hosts, the ML-estimation has computed the two Gammadensities shown in Figure .: fP and fO are approximated byGamma(3.5127, 0.1478) andGamma(1.3747, 0.0618), respectively(standard errors on estimated parameters are 1.3173, 0.0596 for fPand 0.1265, 0.0068 for fO). ese parameter are very similar to theones we estimated for the IDEVAL dataset. Furthermore,

.. Using Non-parametric Statistical Tests

IDEVAL IDEVAL

α β α β

fP . (.) . (.) . (.) . (.)fO . (.) . (.) . (.) . (.)

Table .: Estimated parameters (and their estimation standard er-ror) for the Gamma PDFs on IDEVAL and .

with p-values of . and ., the two KS tests confirm the samestatistical discrepancies we observed on the dataset.

e above numerical results show that, by interpreting alertstreams as random processes, there are several (stochastic) dissimi-larities between net-to-host delays belonging to the same net-hostattack session, and net-to-host delays belonging to different ses-sions. Exploiting these dissimilarities, we may find out the correla-tion among streams in an unsupervised manner, without the needto predefine any parameter.

. N H A C

. Concluding Remarks

In this chapter we first described and evaluated a technique whichuses fuzzy sets and measures to fuse alerts reported by the anomalydetectors. After a brief framework description and precise problemstatement, we analyzed previous literature about alert fusion (i.e.,aggregation and correlation), and found that effective techniqueshave been proposed, but they are not really suitable for anomaly de-tection, because they require a-priori knowledge (e.g., attack namesor division into classes) to perform well.

Our proposal defines simple, but robust criteria for computingthe time distance between alerts in order to take into account uncer-tainty on both measurements and the threshold-distance sizing. Inaddition, we considered the implementation of a post-aggregationphase to remove non-aggregated alerts according to their belief, avalue indicating how much the IDS believes the detected attack tobe real. Moreover, we defined and used some simple metrics for theevaluation of alert fusion systems. In particular, we propose to plotboth the DR and the FPR vs. the degree of output alert reductionvs. the size of the input alert stream.

We performed experiments for validating our proposal. To thisend, we used two prototypes we previously developed: a host anomalydetector, that exploits the analysis of system calls arguments andbehavior, and a network anomaly detector, based on unsupervisedpayload clustering and classification techniques that enables an ef-fective outlier detection algorithm to flag anomalies. During ourexperiments, we were able to outline many shortcomings of theIDEVAL dataset (the only available IDS benchmark) when usedfor evaluating alert fusion systems. In addition to known anoma-lies in network and host data, IDEVAL is outdated both in termsof background traffic and attack scenarios.

Our experiments showed that the proposed fuzzy aggregationapproach is able to decrease the FPR at the price of a small reduc-tion of theDR (a necessary consequence). e approach defines thenotion of “closeness” in time as the natural extension of the naive,crisp way; to this end, we rely both on fuzzy set theory and fuzzymeasures to semantically ground the concept of “closeness”. Bydefinition, our method is robust because it takes into account ma-jor uncertainties on timestamps; this means the choice of windowsize is less sensitive to fluctuations in the network delays because of

.. Concluding Remarks

the smoothing allowed by the fuzziness of the window itself. Ofcourse, if the delays are varying too much, a dynamic resizing isstill necessary. e biggest challenge with our approach would beits extension to the correlation of distributed alerts: in the currentstate, our modeling is not complete, but can potentially be extendedin such a way; being the lack of alert features the main difficult.

Secondly, we analyzed the use of of different types of statisti-cal tests for the correlation of anomaly detection alerts, a problemwhich has little or no solutions available today. One of the few cor-relation proposals that can be applied to anomaly detection is theuse of a GCT. After discussing a possible testing methodology, weobserved that the IDEVAL datasets traditionally used for evalua-tion have various shortcomings, that we partially addressed by usingthe data for a simpler scenario of correlation, investigating only thelink between a stream of host-based alerts for a specific host, andthe corresponding stream of alerts from a network based detector.

We examined the usage of a GCT as proposed in earlier works,showing that it relies on the choice of non-obvious configurationparameters which significantly affect the final result. We also showedthat one of these parameters (the order of the models) is absolutelycritical, but cannot be uniquely estimated for a given system. In-stead of the GCT, we proposed a simpler statistical model of alertgeneration, describing alert streams and timestamps as stochasticvariables, and showed that statistical tests can be used to create areasonable criterion for distinguishing correlated and non corre-lated streams. We proved that our criteria work well on the simpli-fied correlation task we used for testing, without requiring complexconfiguration parameters.

esewere exploratory works, and further investigations on longersequences of data, as well as further refinements of the tests and thecriteria we proposed, are surely needed. Another possible extensionof this work is the investigation of how these criteria can be usedto correlate anomaly and misuse-based alerts together, in order tobridge the gap between the existing paradigms of intrusion detec-tion. In particular, another possible correlation criteria could bebuild upon the observation that a compromised host behaves dif-ferently (e.g., its response time increases).

Conclusions

e main question that we attempted to answer with our researchwork is, basically, to what extent classic ID approaches can be adaptedand integrated to mitigate todays’ Internet threats. Examples of suchthreats include attacks against web applications like SQL injec-tions, client-sidemalware that force the browser to download virusesor connect to botnets, and so forth. Typically, these attempts arelabeled as malicious activities. e short answer to the question isthe following.

As long as the technologies (e.g., applications, protocols, de-vices) will prevent us to reach a sophistication level such that ma-licious activity is seamlessly camouflaged as normal activity, thenID techniques will constitute an effective countermeasure. is isbecause IDSs — in particular those that leverage anomaly-basedtechniques — are specifically designed to detect unexpected eventsin a computer infrastructure. e crucial point of anomaly-basedtechniques is they are conceived to be generic. In principle, theyindeed make no difference between an alteration of a process’ con-trol flow caused by an attempt to interpret a crafted JavaScript codeand one due to a buffer overflow being exploited. us, as long asthe benign activity of a system is relatively simple to model, thenID techniques are the building block of choice to design effectiveprotections.

. C

Like many other researches, this work is meant to be a longerand more articulated answer to the aforementioned question. econcluding remarks of Chapter , , and , summarize the resultsof our contributions to mitigate the issues we identified in host andweb IDSs, and alert correlation systems. A common line to theworks presented in this thesis is the problem of false detections,which is certainly one of the most significant barriers to the wideadoption of anomaly-based systems. In fact, the tools that are avail-able to the public already offer superb detection capabilities that canrecognize all the known threats and, in theory, are effective alsoagainst unknown malicious activities. However, a large share of thealerts fired by these tools are negligible; either because they regardthreats that do not apply to the actual scenario (e.g., unsuccessfulattacks or vulnerable software version mismatches) or because therecognized anomaly does not reflect an attack at all (e.g., a soft-ware is upgraded and a change is confused with an threat). Falsedetections mean time spent by the security officers to investigatethe possible causes of the attack, thus, false detections mean costs.

We demonstrated that most of the false detections, especiallyfalse positives, can be prevented by carefully designing the mod-els of normal activity. For example, we have been able to suppressmany false positives caused by too strict model constraints. In par-ticular, we substituted the “crisp” checks performed by determinis-tic relations learned over system calls’ arguments with “smoother”models (e.g., Gaussian distributions instead of simple ranges) that,as we shown in Section .., preserve the detection capabilitiesand decrease the rate of false alerts.

We also have shown that another good portion of false posi-tives can be avoided by solving training issues. In particular, ourcontributions on detection of attacks against web applications haveidentified that, in certain cases, training is responsible formore than of the false positives. We proposed to solve this issue by dy-namically updating models of benign activity while the system isrunning. Even though our solution may pose new, limited risks insome situations, it is capable of suppressing all the false detectionsdue to incomplete training; and, given the low base-rate of attackswe pointed out in Section .., the resulting system offers a good

Recall that a large portion of the systems proposed in the literature have beentested using Snort as a baseline, as we argue in Section .

balance between protection and costs (due to false positives).Another important, desirable feature of an IDS is its capability

of recognizing logically-related alerts. Note that, however, this isnothing but a slightly more sophisticated alert reduction mecha-nism, which once again means decreasing the effort of the securityofficer. In fact, as we have shown in the last chapter of this work,alert aggregation techniques can be leveraged to reduce the amountof false detections, not just for compressing them into more com-pact reports. However, alert correlation is a very difficult task andmany efforts have been proposed to address it. Our point is that theresearch on this topic is still very limited and no common directionscan be identified as we highlighted in Section ..

e main future directions of our research have been alreadydelineated in each chapter’s conclusions. In addition, there are afew ideas that are not mature enough and thus have not been de-veloped yet, even though they were planned to be included in thiswork. Regarding client-side protection mechanisms for browsers,we are designing a Firefox extension to learn the structure of benignweb pages in terms of objects of different types typically contained.Here, the term “structure” refers to both the types of objects (i.e.,images, videos) and the order these objects are fetched with andfrom which domains. We believe that this is a minimal set of fea-tures that can be used to design very simple classification featuresto distinguish between legit and malicious pages. One of our goalsis to detect pages that contain an unexpected “redirection pattern”due to <iframe />s leveraged by drive-by downloads.

Regarding server-side protection techniques we are currentlyinvestigating the possibility of mapping some features of an HTTPrequest, such as the order of the parameters or the characteristics oftheir content, with the resulting SQL query, if any. If such a map-ping is found, many of the existing anomaly detection techniquesused to protect the DBs can be coupled with web-based IDS to of-fer a double layer protection mechanism. is idea can be leveragedto distinguishing between high-risk requests, which permanentlymodify a resource, and low-risk ones, which only temporarily af-fect a web page.

Bibliography

H. Akaike. A new look at the statistical model identification. Au-tomatic Control, IEEE Transactions on, ():–, .

Jesse Alpert and Nissan Hajaj. We knew the web wasbig... Available online at http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html, Jul .

J. P. Anderson. Computer security threat monitoring and surveil-lance. Technical report, J. P. Anderson Co., Ft. Washington,Pennsylvania, April .

omas Augustine, Us Naval Academy, Ronald C. Dodge, andUs Military Academy. Cyber defense exercise: Meeting learn-ing objectives thru competition, .

S. Axelsson. Intrusion detection systems: A survey and taxonomy.Technical report, Chalmers University of Technology, Dept. ofComputer Engineering, G”oteborg, Sweden, March a.

Stefan Axelsson. e Base-Rate Fallacy and the Difficulty of In-trusion Detection. ACMTransactions on Information and SystemSecurity, ():–, August b.

Hal Berghel. Hiding data, forensics, and anti-forensics. Com-munications ACM, ():–, . ISSN -. doi:http://doi.acm.org/./..

S. Bhatkar, A. Chaturvedi, and R. Sekar. Dataflow anomaly detec-tion. Security and Privacy, IEEE Symposium on, page ,May . ISSN -. doi: ./SP...

B

Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel, and Em-manuele Zambon. Poseidon: a -tier anomaly-based networkintrusion detection system. In IWIA, pages –. IEEEComputer Society, . ISBN ---.

Damiano Bolzoni, Bruno Crispo, and Sandro Etalle. Atlantides:an architecture for alert verification in network intrusion detec-tion systems. In LISA’: Proceedings of the st conference onLarge Installation System Administration Conference, pages –,Berkeley, CA, USA, . USENIX Association. ISBN ----.

Mariusz Burdach. In-memory forensics tools. available online athttp://forensic.seccure.net/, .

J. B. D. Cabrera, L. Lewis, and R.K. Mehara. Detection and clas-sification of intrusion and faults using sequences of system calls.ACM SIGMOD Record, (), .

Sanghyun Cho and Sungdeok Cha. Sad: web session anomalydetection based on parameter estimation. In Computers & Se-curity, volume , pages –, . doi: DOI:./j.cose....

Privacy Rights Clearinghouse. A chronology of data breaches.Technical report, Privacy Rights Clearinghouse, July .

William W. Cohen. Fast effective rule induction. In ArmandPrieditis and Stuart Russell, editors, Proceedings of the th Inter-national Conference on Machine Learning, pages –, TahoeCity, CA, Jul . Morgan Kaufmann. ISBN ---.

W.W. Cohen, P. Ravikumar, and S.E. Fienberg. A comparison ofstring distance metrics for name-matching tasks. In Proceedingsof the IJCAI-Workshop on Information Integration on theWeb(IIWeb-), .

Core Security Technologies. CORE Impact. http://www.coresecurity.com/?module=ContentMod&action=item&id=32, .

Manuel Costa, Jon Crowcroft, Miguel Castro, Antony Rowstron,Lidong Zhou, Lintao Zhang, and Paul Barham. Vigilante: end-to-end containment of internet worms. SIGOPSOper. Syst. Rev.,

Bibliography

():–, . ISSN -. doi: http://doi.acm.org/./..

Gabriela F. Cretu, Angelos Stavrou, Michael E. Locasto, Salva-tore J. Stolfo, and Angelos D. Keromytis. Casting out demons:Sanitizing training data for anomaly sensors. Security and Pri-vacy, IEEE Symposium on, :–, a. doi: http://doi.ieeecomputersociety.org/./SP...

Gabriela F. Cretu, Angelos Stavrou, Michael E. Locasto, Salva-tore J. Stolfo, and Angelos D. Keromytis. Casting out Demons:Sanitizing Training Data for Anomaly Sensors. In Proceedings ofthe IEEE Symposium on Security and Privacy (S&P ),pages –, Oakland, CA, USA, May b. IEEE ComputerSociety.

Claudio Criscione, Federico Maggi, Guido Salvaneschi, and Ste-fano Zanero. Integrated detection of attacks against browsers,web applications and databases. In European Conference on Com-puter Network Defence - ECND , .

Frederic Cuppens and Alexandre Miage. Alert correlation in a co-operative intrusion detection framework. In Proceedings of the IEEE Symposium on Security and Privacy, page , Wash-ington, DC, USA, . IEEE Computer Society. ISBN ---.

O. Dain and R. Cunningham. Fusing heterogeneous alert streamsinto scenarios. In Proceedings of the ACMWorkshop onDataMin-ing for Security Applications, pages –, Nov .

H. Debar, M. Dacier, and A. Wespi. Towards a taxonomy ofintrusion-detection systems. Comput. Networks, ():–,.

H. Debar, M. Dacier, and A. Wespi. A revised taxonomy forintrusion-detection systems. Annals of Telecommunications, ():–, .

H. Debar, D. Curry, and B. Feinstein. e Intrusion DetectionMessage Exchange Format. Technical report, France Telecomand Guardian and TNT, Mar. .

B

Herve Debar and Andreas Wespi. Aggregation and correlationof intrusion-detection alerts. In Proceedings of the th Interna-tional Symposium onRecent Advances in IntrusionDetection, pages–, London, UK, . Springer-Verlag. ISBN ---.

Dorothy E. Denning. An Intrusion-Detection Model. IEEETransactions on Software Engineering, ():–, .ISSN -. doi: http://dx.doi.org/./TSE...

S. Eckmann, G. Vigna, and R. Kemmerer. STATL: An attacklanguage for state-based intrusion detection. In Proceedings ofthe ACMWorkshop on Intrusion Detection, Athens, Nov. .

AK Elmagarmid, PG Ipeirotis, and VS Verykios. DuplicateRecord Detection: A Survey. Knowledge and Data Engineering,IEEE Transactions on, ():–, .

Facebook. Statistics. Available online at http://www.facebook.com/press/info.php?statistics, .

Ferruh Mavituna. SQL Injection Cheat Sheet. http://ferruh.mavituna.com/sql-injection-cheatsheet-oku/, June .

Christof Fetzer and Martin Suesskraut. Switchblade: enforcingdynamic personalized system call models. In Eurosys ’: Pro-ceedings of the rd ACMSIGOPS/EuroSys European Conference onComputer Systems , pages –, New York, NY, USA,. ACM. ISBN ----. doi: http://doi.acm.org/./..

Stephanie Forrest, Steven A. Hofmeyr, Anil Somayaji, andomas A. Longstaff. A sense of self for Unix processes. In Pro-ceedings of the IEEE Symp. on Security and Privacy, Wash-ington, DC, USA, . IEEE Computer Society. ISBN ---.

J.C. Foster and V. Liu. Catch me if you can... In Blackhat Briefings, Las Vegas, NV, August . URL http://www.blackhat.com/presentations/bh-usa-05/bh-us-05-foster-liu-update.pdf.

Bibliography

Vanessa Frias-Martinez, Salvatore J. Stolfo, and Angelos D.Keromytis. Behavior-Profile Clustering for False Alert Reduc-tion in Anomaly Detection Sensors. In Proceedings of the AnnualComputer Security Applications Conference (ACSAC ), Ana-heim, CA, USA, December .

Alessandro Frossi, Federico Maggi, Gian Luigi Rizzo, and Ste-fano Zanero. Selecting and Improving System Call Models forAnomaly Detection. In Ulrich Flegel and Michael Meier, ed-itors, DIMVA, Lecture Notes in Computer Science. Springer,.

Simson Garfinkel. Anti-Forensics: Techniques, Detection andCountermeasures. In Proceedings of the nd International Con-ference on i-Warfare and Security (ICIW), pages –, .

SL Garfinkel and A. Shelat. Remembrance of data passed: astudy of disk sanitization practices. Security & Privacy Maga-zine, IEEE, ():–, .

M. Geiger. Evaluating Commercial Counter-Forensic Tools. InProceedings of the th Annual Digital Forensic Research Workshop,.

Jonathon T. Giffin, David Dagon, Somesh Jha, Wenke Lee, andBarton P. Miller. Environment-sensitive intrusion detection. InProceedings of the Recent Advances in Intrusion Detection (RAID),pages –, .

GeneH.Golub andCharles F. VanLoan.Matrix computations (rded.). Johns Hopkins University Press, Baltimore, MD, USA,. ISBN ---.

Sarah Granger. Social engineering fundamentals, part i: Hackertactics. Available online at http://www.securityfocus.com/infocus/1527, Dec .

Grugq. e art of defiling: defeating forensic analysis. In Blackhatbriefings , Las Vegas, NV, August . URL http://www.blackhat.com/presentations/bh-usa-05/bh-us-05-grugq.pdf.

P. Haccou and E. Meelis. Statistical analysis of behavioural data. Anapproach based on timestructured models. Oxford university press,.

B

Joshua Haines, Dorene Kewley Ryder, Laura Tinnel, and StephenTaylor. Validation of sensor alert correlators. IEEE Security andPrivacy, ():–, . ISSN -.

J. Han and M. Kamber. Data Mining: concepts and techniques.Morgan-Kauffman, .

Ryan Harris. Arriving at an anti-forensics consensus: Examin-ing how to define and control the anti-forensics problem. InProceedings of the th Annual Digital Forensic Research Workshop(DFRWS ’), volume of Digital Investigation, pages –,September . URL http://www.sciencedirect.com/science/article/B7CW4-4KCPVBY-4/2/185856560fb7d3b59798e99c909f7754.

Harry. Exploit for CVE--. available online at http://www.milw0rm.com/exploits/3578, .

S. Hofmeyr, S. Forrest, and A. Somayaji. Intrusion detection usingsequences of system calls. J. of Computer Security, :–,.

orsten Holz. A short visit to the bot zoo. IEEE Security &Privacy, ():–, .

Kenneth L. Ingham, Anil Somayaji, John Burge, and StephanieForrest. Learning dfa representations of http for protectingweb applications. Computer Networks, ():–, April. ISSN -. doi: http://dx.doi.org/./j.comnet....

S. Jha, K. Tan, and R. A. Maxion. Markov chains, classifiers, andintrusion detection. In CSFW ’: Proc. of the th IEEE Work-shop on Computer Security Foundations, page , Washington,DC, USA, . IEEE Computer Society.

DN Joanes and CA Gill. Comparing Measures of Sample Skew-ness and Kurtosis. e Statistician, ():–, .

N.F. Johnson and S. Jajodia. Exploring steganography: Seeing theunseen. COMPUTER, ():–, .

Wen-Hua Ju and Y. Vardi. A hybrid high-order Markov chainmodel for computer intrusion detection. Journal of Computa-tional and Graphical Statistics, :–, .

Bibliography

B.H. Juang and LR Rabiner. A probabilistic distance measure forhidden Markov models. AT&T Bell Laboratories Technical Jour-nal, ():–, .

Klaus Julisch and Marc Dacier. Mining intrusion detection alarmsfor actionable knowledge. In KDD ’: Proceedings of the eighthACM SIGKDD International Conference on Knowledge discoveryand data mining, pages –, New York, NY, USA, .ACM Press. ISBN ---X.

M. G. Kendall. A New Measure of Rank Correlation. Biometrika,(–):–, June .

George J. Klir and Tina A. Folger. Fuzzy sets, uncertainty, andinformation. Prentice-Hall, Inc., Upper Saddle River, NJ, USA,. ISBN ---.

Calvin Ko, George Fink, and Karl Levitt. Automated detection ofvulnerabilities in privileged programs by execution monitoring.InProc. of the th Ann. Computer Security Applications Conf., vol-ume XIII, pages –. IEEE Computer Society Press, LosAlamitos, CA, USA, .

T. Kohonen. Self Organizing Maps. Springer-Verlag, .

Teuvo Kohonen and Panu Somervuo. Self-organizingmaps of symbol strings. Neurocomputing, (-):–,. URL http://www.sciencedirect.com/science/article/B6V10-3V7S70G-3/2/1439043d6f41db0afa22e46c6f2b809a.

J.Z. Kolter and M.A. Maloof. Dynamic weighted majority: Anensemble method for drifting concepts. e Journal of MachineLearning Research, :–, .

C. Kruegel, D. Mutz, F. Valeur, and G. Vigna. On the detection ofanomalous system call arguments. In Proceedings of the Eu-ropean Symp. on Research in Computer Security, Gjøvik, Norway,Oct. a.

Christopher Kruegel and William Robertson. Alert Verification:Determining the Success of Intrusion Attempts. In Proceedingsof the Workshop on the Detection of Intrusions andMalware & Vul-nerability Assessment (DIMVA ), Dortmund, North Rhine-Westphalia, GR, July .

B

Christopher Kruegel, omas Toth, and Engin Kirda. Service-Specific Anomaly Detection for Network Intrusion Detection.In Proceedings of the Symposium on Applied Computing (SAC), Spain, March .

Christopher Kruegel, Darren Mutz, William Robertson, andFredrik Valeur. Bayesian Event Classification for Intrusion De-tection. In Proceedings of the Annual Computer Security Applica-tions Conference (ACSAC ), Las Vegas, NV, USA, Decem-ber b.

Christopher Kruegel, William Robertson, and Giovanni Vigna. AMulti-model Approach to the Detection of Web-based Attacks.Journal of Computer Networks, ():–, July .

Christopher Kruegel, Darren Mutz, William Robertson, FredrickValeur, and Giovanni Vigna. LibAnomaly (website). availableonline at http://www.cs.ucsb.edu/~seclab/projects/libanomaly/index.html, .

K. Labib and R. Vemuri. NSOM: A real-time network-based in-trusion detection system using self-organizing maps. Technicalreport, Department of Applied Science, University of California,Davis, .

Sin Yeung Lee, Wai Lup Low, and Pei Yuen Wong. Learningfingerprints for a database intrusion detection system. In ES-ORICS ’: Proceedings of the th European Symposium on Re-search in Computer Security, pages –, London, UK, .Springer-Verlag. ISBN ---.

Wenke Lee and Salvatore Stolfo. Data mining approaches forintrusion detection. In Proceedings of the th USENIX SecuritySymp., San Antonio, TX, .

Wenke Lee and Salvatore J. Stolfo. A framework for construct-ing features and models for intrusion detection systems. ACMTransactions on Information and System Security, ():–,. ISSN -. doi: http://doi.acm.org/./..

Wenke Lee and Dong Xiang. Information-eoretic Measures forAnomaly Detection. In Proceedings of the IEEE Symposium on

Bibliography

Security and Privacy (S&P ), pages –, Oakland, CA,USA, May . IEEE Computer Society.

Richard Lippmann, Robert K. Cunningham, David J. Fried, IsaacGraf, Kris R. Kendall, Seth E. Webster, and Marc A. Zissman.Results of the DARPA offline intrusion detection evalua-tion. In Recent Advances in Intrusion Detection, .

Richard Lippmann, Joshua W. Haines, David J. Fried, JonathanKorba, and Kumar Das. e DARPA off-line intrusiondetection evaluation. Comput. Networks, ():–, .ISSN -.

R.J.A. Little and D.B. Rubin. Statistical analysis with missing data.Wiley New York, .

Rune B. Lyngsø, Christian N. S. Pedersen, and Henrik Nielsen.Metrics and similarity measures for hidden markov models. InProceedings of the Seventh International Conference on IntelligentSystems forMolecular Biology, pages –. AAAI Press, .ISBN ---.

F. Maggi, S. Zanero, and V. Iozzo. Seeing the invisible - forensicuses of anomaly detection andmachine learning. ACMOperatingSystems Review, April .

Federico Maggi and Stefano Zanero. On the use of different statis-tical tests for alert correlation. In Christopher Kruegel, RichardLippmann, and Andrew Clark, editors, RAID, volume of Lecture Notes in Computer Science, pages –. Springer,. ISBN ----.

Federico Maggi, Matteo Matteucci, and Stefano Zanero. De-tecting intrusions through system call sequence and argumentanalysis (preprint). IEEE Transactions on Dependable and Se-cure Computing, (), a. ISSN -. doi: http://doi.ieeecomputersociety.org/./TDSC...

Federico Maggi, Matteo Matteucci, and Stefano Zanero. Reduc-ing False Positives In Anomaly Detectors rough Fuzzy AlertAggregation. Information Fusion, b.

B

Federico Maggi, William Robertson, Christopher Kruegel, andGiovanni Vigna. Protecting a moving target: Addressing webapplication concept drift. In Engin Kirda and Davide Balzarotti,editors, RAID, Lecture Notes in Computer Science. Springer,c.

M. V. Mahoney. Network traffic anomaly detection based onpacket bytes. In Proceedings of the th Annual ACM Symposiumon Applied Computing, .

M. V. Mahoney and P. K. Chan. An analysis of the DARPA/ Lincoln laboratory evaluation data for network anomaly de-tection. In Proceedings of the th International Symp. on RecentAdvances in Intrusion Detection (RAID ), pages –,Pittsburgh, PA, USA, Sept. a.

Matthew V. Mahoney and Philip K. Chan. Learning rules foranomaly detection of hostile network traffic. In Proceedings ofthe rd IEEE International Conference onDataMining, page ,b. ISBN ---.

M.V. Mahoney and P.K. Chan. Detecting novel attacks by identi-fying anomalous network packet headers. Technical Report CS--, Florida Institute of Technology, .

John McHugh. Testing Intrusion Detection Systems: A Critiqueof the and DARPA IntrusionDetection SystemEval-uations as Performed By Lincoln Laboratory. ACM Transac-tions on Information and System Security, ():–, Novem-ber . ISSN -. doi: http://doi.acm.org/./..

N. Merhav, M. Gutman, and J. Ziv. On the estimation of the orderof a Markov chain and universal data compression. IEEE Trans.Inform. eory, :–, Sep .

Metasploit. Mafia: Metasploit anti-forensics investigation ar-senal. available online at http://www.metasploit.org/research/projects/antiforensics/, .

C. C. Michael and Anup Ghosh. Simple, state-based approachesto program-based anomaly detection. ACM Transactions on In-formation and System Security, ():–, . ISSN -.

Bibliography

David L. Mills. Network time protocol (version ). Technicalreport, University of Delaware, .

Miniwatts Marketing Grp. World Internet Usage Statistics. http://www.internetworldstats.com/stats.htm, January .

Kevin Mitnick. e art of deception. Wiley, .

George M. Mohay, Alison Anderson, Byron Collie, Rodney D.McKemmish, and Olivier de Vel. Computer and Intrusion Foren-sics. Artech House, Inc., Norwood, MA, USA, . ISBN.

Benjamin Morin, Ludovic Me, Herve Debar, and MireilleDucasse. Md: A formal data model for IDS alert correlation.InWespi et al. [], pages –. ISBN ----.

D. Mutz, F. Valeur, C. Kruegel, and G. Vigna. Anomalous SystemCall Detection. ACM Transactions on Information and SystemSecurity, ():–, February .

Darren Mutz, William Robertson, Giovanni Vigna, andRichard A. Kemmerer. Exploiting execution context forthe detection of anomalous system calls. In Proceedings of the In-ternational Symposium on Recent Advances in Intrusion Detection(RAID ), Gold Coast, Queensland, AU, September .Springer.

National Vulnerability Database. Exploit for CVE--. available online at http://nvd.nist.gov/nvd.cfm?cvename=CVE-2007-1719, a.

National Vulnerability Database. Exploit for CVE--. available online at http://nvd.nist.gov/nvd.cfm?cvename=CVE-2007-3641, b.

J. Newsome and D. Song. Dynamic taint analysis for automaticdetection, analysis, and signature generation of exploits on com-modity software. In Network and Distributed System SecuritySymposium (NDSS). Citeseer, .

B

Jr. Nick L. Petroni, AAron Walters, Timothy Fraser, andWilliam A. Arbaugh. Fatkit: A framework for the extrac-tion and analysis of digital forensic data from volatile sys-tem memory. Digital Investigation, ():–, decem-ber . URL http://www.sciencedirect.com/science/article/B7CW4-4MD9G8V-1/2/7de54623f0dc5e1ec1306bfc96686085.

Peng Ning, Yun Cui, Douglas S. Reeves, and Dingbang Xu. Tech-niques and tools for analyzing intrusion alerts. ACM Trans.Inf. Syst. Secur., ():–, . ISSN -. doi:http://doi.acm.org/./..

Ofer Shezaf and Jeremiah Grossman and Robert Auger. WebHacking Incidents Database. http://www.xiom.com/whid-about,January .

Dirk Ourston, Sara Matzner, William Stump, and Bryan Hopkins.Applications ofHiddenMarkovModels to detectingmulti-stagenetwork attacks. In Proceedings of the th Annual Hawaiian In-ternational Conference on System Sciences, page , .

Wiebe R. Pestman. Mathematical Statistics: An Introduction. Wal-ter de Gruyter, .

S. Piper, M. Davis, G. Manes, and S. Shenoi. Detecting Hid-den Data in Ext/Ext File Systems, volume of IFIP Inter-national Federation for Information Processing, chapter , pages–. Springer, Boston, .

Pluf and Ripe. Advanced antiforensics self. available online at http://www.phrack.org/issues.html?issue=63&id=11&mode=txt, .

Mark Pollitt. Computer forensics: an approach to evidence in cy-berspace. In Proceedings of the National Information Systems Secu-rity Conference, volume II, pages –, Baltimore, Maryland,.

Phillip A. Porras, MartinW. Fong, andAlfonso Valdes. Amission-impact-based approach to infosec alarm correlation. In Wespiet al. [], page . ISBN ----.

Georgios Portokalidis and Herbert Bos. Sweetbait: Zero-hour worm detection and containment using low- and high-interaction honeypots. Comput. Netw., ():–, .

Bibliography

ISSN -. doi: http://dx.doi.org/./j.comnet....

Georgios Portokalidis, Asia Slowinska, and Herbert Bos. Argos:an emulator for fingerprinting zero-day attacks. In Proc. ACMSIGOPS EUROSYS’, Leuven, Belgium, April .

Bruce Potter. e Shmoo Group Capture the CTF project. http://cctf.shmoo.com, .

Nicholas Puketza, Mandy Chung, Ronald A. Olsson, andBiswanath Mukherjee. A software platform for testing intrusiondetection systems. IEEE Software, ():–, . ISSN-. doi: http://dx.doi.org/./..

Nicholas J. Puketza, Kui Zhang, Mandy Chung, BiswanathMukherjee, and Ronald A. Olsson. A methodology for test-ing intrusion detection systems. IEEE Transactions on SoftwareEngineering, ():–, . ISSN -. doi:http://dx.doi.org/./..

Xinzhou Qin and Wenke Lee. Statistical causality analysis of in-fosec alert data. In RAID, pages –, .

L. R. Rabiner. A tutorial on hidden Markov models and selectedapplications in speech recognition. In Proceedings of the IEEE,volume , pages –, .

M. Ramadas. Detecting anomalous network traffic with self-organizing maps. In Recent Advances in Intrusion Detectionth International Symposium, RAID , Pittsburgh, PA, USA,September -, , Proceedings, Mar .

Marcus J. Ranum. e six dumbest ideas in computersecurity. http://www.ranum.com/security/computer_security/editorials/dumb/, Sept. .

S. Ring and E. Cole. Volatile Memory Computer Forensics to De-tect Kernel Level Compromise. In Proceedings of the th Inter-national Conference on Information And Communications Security(ICICS ), Malaga, Spain, October . Springer.

Ivan Ristic. mod security: Open SourceWebApplication Firewall.http://www.modsecurity.org/, June .

B

Robert Hansen. XSS (Cross Site Scripting) Cheat Sheet. http://ha.ckers.org/xss.html, June .

WilliamRobertson. webanomaly -WebApplication AnomalyDe-tection. http://webanomaly.googlecode.com, .

William Robertson, Federico Maggi, Christopher Kruegel, andGiovanni Vigna. Effective anomaly detection with scarce train-ing data. In Annual Network & Distributed System Security Sym-posium, .

Martin Roesch. Snort - lightweight intrusion detection for net-works. In Proceedings of USENIX LISA , .

Martin Roesch. Snort - Lightweight Intrusion Detection for Net-works (website). http://www.snort.org, .

Hettich S. and Bay S. D. KDD Cup ’ Dataset. http://kdd.ics.uci.edu/, .

Benjamin Sangster. http://www.itoc.usma.edu/research/dataset/index.html, April .

Benjamin Sangster, T. J. O’Connor, omas Cook, Robert Fanelli,Erik Dean, William J. Adams, Chris Morrell, and GregoryConti. Toward instrumenting network warfare competitionsto generate labeled datasets. In USENIX Security’s Workshop.USENIX, August .

Bradley Schatz. Bodysnatcher: Towards reliable volatile mem-ory acquisition by software. In Proceedings of the th An-nual Digital Forensic Research Workshop (DFRWS ’), vol-ume of Digital Investigation, pages –, September. URL http://www.sciencedirect.com/science/article/B7CW4-4P06CJD-3/2/081292b9131eca32c02ec2f5599bb807.

J.C. Schlimmer and R.H. Granger. Beyond incremental process-ing: Tracking concept drift. In Proceedings of the Fifth NationalConference on Artificial Intelligence, volume , pages –,.

Secunia. Secunia’s annual report. Available online at http://secunia.com/gfx/Secunia2008Report.pdf, .

Bibliography

R. Sekar, M. Bendre, D. Dhurjati, and P. Bollineni. A fastautomaton-based method for detecting anomalous program be-haviors. In Proceedings of the IEEE Symposium on Security andPrivacy (S&P ), pages –, Oakland, CA, USA, May. IEEE Computer Society. ISBN ---.

R. Sekar, A. Gupta, J. Frullo, T. Shanbhag, A. Tiwari, H. Yang,and S. Zhou. Specification-based anomaly detection: a new ap-proach for detecting network intrusions. In CCS ’: Proceedingsof the th ACMConference on Computer and communications secu-rity, pages –, New York, NY, USA, . ACM Press.ISBN ---.

Monirul I. Sharif, Kapil Singh, Jonathon T. Giffin, and WenkeLee. Understanding precision in host based intrusion detection.In RAID, pages –, .

Edward Hance Shortliffe. Computer-based medical consultations:MYCIN. Elsevier, .

Adam Singer. Social media, web . and internet stats.Available online at http://thefuturebuzz.com/2009/01/12/social-media-web-20-internet-numbers-stats/, Jan .

Sumeet Singh, Cristian Estan, George Varghese, and Stefan Sav-age. Automated worm fingerprinting. In OSDI’: Proceedingsof the th conference on Symposium on Opearting Systems Design &Implementation, pages –, Berkeley, CA, USA, . USENIXAssociation.

Anil Somayaji and Stephanie Forrest. Automated response usingsystem-call delays. In Proceedings of the th USENIX SecuritySymp., Denver, CO, Aug. .

Panu J. Somervuo. Online algorithm for the self-organizing map ofsymbol strings. Neural Netw., (-):–, . ISSN-. doi: http://dx.doi.org/./j.neunet....

Y. Song, SJ Stolfo, and AD Keromytis. Spectrogram: A Mixture-of-Markov-Chains Model for Anomaly Detection in Web Traf-fic. In Proc of the th Annual Network and Distributed SystemSecurity Symposium (NDSS), .

B

Sourcefire. Snort Rules. http://www.snort.org/snort-rules, .

A. Stolcke and S. Omohundro. Hidden Markov Model Inductionby Bayesian Model Merging. Advances in Neural InformationProcessing Systems, pages –, a.

A. Stolcke and S. Omohundro. Inducing Probabilistic Grammarsby Bayesian Model Merging. Lecture Notes in Computer Science,pages –, a.

Andreas Stolcke and Stephen Omohundro. Hidden MarkovModel induction by bayesian model merging. In Advances inNeural Information Processing Systems, volume , pages –.Morgan Kaufmann, b.

Andreas Stolcke and Stephen M. Omohundro. Inducing proba-bilistic grammars by bayesian model merging. In Proceedings ofthe nd International Colloquium on Grammatical Inference andApplications, pages –, London, UK, b. Springer-Verlag. ISBN ---.

Andreas Stolcke and Stephen M. Omohundro. Best-first ModelMerging for Hidden Markov Model Induction. Technical Re-port TR--, ICSI, Berkeley, CA, USA, c.

Brett Stone-Gross, Marco Cova, Lorenzo Cavallaro, Bob Gilbert, Mar tin Szydlowski andRichard Kemmerer, and Christo-pher Kruegel andGiovanni Vigna. Your botnet is my botnet:Analysis of a botnet takeover. In CCS , Chicago, Novem-ber . ACM.

Alon Swartz. e exploittree repository. http://www.securityforest.com/wiki/index.php/Category:ExploitTree, .

G. Tandon and P. Chan. Learning rules from system call argu-ments and sequences for anomaly detection. In ICDMWorkshopon Data Mining for Computer Security (DMSEC), pages –,.

Steven J. Templeton andKarl Levitt. A requires/providesmodel forcomputer attacks. InNSPW ’: Proceedings of the workshopon New security paradigms, pages –, New York, NY, USA,. ACM Press. ISBN ---. doi: http://doi.acm.org/./..

Bibliography

e SANS Institute. e twenty most critical internet security vul-nerabilities. http://www.sans.org/top20/, Nov. .

Walter N. urman and Mark E. Fisher. Chickens, eggs, andcausality, or which came first? Am. J. of Agricultural Economics,.

Dean Turner, Marc Fossi, Eric Johnson, Trevor Mark, JosephBlackbird, Stephen Entwise, Mo King Low, David McKinney,and Candid Wueest. Symantec Global Internet Security reatReport – Trends for . Technical Report XIV, SymantecCorporation, April .

Alfonso Valdes and Keith Skinner. Probabilistic Alert Correlation.In Proceedings of the International Symposium on Recent Advancesin Intrusion Detection (RAID ), pages –, London, UK,. Springer-Verlag. ISBN ---.

F. Valeur, G. Vigna, C. Kruegel, and R.A. Kemmerer. A compre-hensive approach to intrusion detection alert correlation. IEEETransactions on Dependable and Secure Computing, ():–,. ISSN -. doi: http://dx.doi.org/./TDSC...

WN Venables and B.D. Ripley. Modern Applied Statistics with S.Springer, .

Giovanni Vigna and Richard A. Kemmerer. NetSTAT: ANetwork-based IntrusionDetection System. Journal of ComputerSecurity, ():–, . ISSN -X.

Giovanni Vigna, William Robertson, and Davide Balzarotti. Test-ing Network-based Intrusion Detection Systems Using MutantExploits. In Proceedings of the ACM Conference on Computer andCommunications Security (CCS ), Washington DC, USA,October . ACM.

Jouni Viinikka, Herve Debar, Ludovic Me, and Renaud Seguier.Time series modeling for ids alert management. In ASIACCS’: Proceedings of the ACM Symp. on Information, com-puter and communications security, pages –, New York,NY, USA, . ACM Press. ISBN ---.

B

David Wagner and Drew Dean. Intrusion detection via static anal-ysis. In Proceedings of the IEEE Symposium on Security and Pri-vacy (S&P ), pages –, Oakland, CA, USA, .IEEE Computer Society. doi: ./SECPRI...

Ke Wang and Salvatore J. Stolfo. Anomalous payload-based net-work intrusion detection. In Proceedings of the International Sym-posium on Recent Advances in Intrusion Detection (RAID ).Springer-Verlag, September .

Ke Wang, Gabriela Cretu, and Salvatore J. Stolfo. AnomalousPayload-based Worm Detection and Signature Generation. InProceedings of the International Symposium on Recent Advances inIntrusionDetection (RAID ), Seattle, WA, USA, September. Springer-Verlag.

Ke Wang, Janak J. Parekh, and Salvatore J. Stolfo. Anagram: Acontent anomaly detector resistant to mimicry attack. In Pro-ceedings of the International Symposium on Recent Advances in In-trusionDetection (RAID), Hamburg, GR, September .Springer-Verlag.

Zhenyuan Wang and George J. Klir. Fuzzy Measure eory.Kluwer Academic Publishers, Norwell, MA, USA, . ISBN.

Robert Watson. Openbsm. http://www.openbsm.org, .

Robert N. M. Watson and Whayne Salamon. e FreeBSD auditsystem. In UKUUG LISA Conference, Durham, UK, Mar. .

Andreas Wespi, Giovanni Vigna, and Luca Deri, editors. Re-cent Advances in IntrusionDetection, th International Symposium,RAID Zurich, Switzerland, October –, Proceed-ings, volume of Lecture Notes in Computer Science, .Springer. ISBN ----.

R. Xu and D. Wunsch. Survey of clustering algorithms. IEEETransactions on Neural Networks, ():–, .

K. Yamanishi, J.-I. Takeuchi, G. J. Williams, and P. Milne. On-line unsupervised outlier detection using finite mixtures with dis-counting learning algorithms. Knowledge Discovery and DataMining, ():–, .

Bibliography

Dit-Yan Yeung and Yuxin Ding. Host-based intrusion detectionusing dynamic and static behavioral models. Pattern Recognition,:–, Jan. .

Robert H’obbes’ Zakon. Hobbes’ internet timeline v.. Availableonline at http://www.zakon.org/robert/internet/timeline/, Nov.

S. Zanero. Improving self organizingmap performance for networkintrusion detection. In SDM Workshop on “Clustering HighDimensional Data and its Applications”, a.

Stefano Zanero. Behavioral intrusion detection. In CevdetAykanat, Tugrul Dayar, and Ibrahim Korpeoglu, editors,th Int’l Symp. on Computer and Information Sciences - ISCIS, volume of Lecture Notes in Computer Science, pages–, Kemer-Antalya, Turkey, Oct. . Springer. ISBN---.

Stefano Zanero. Analyzing tcp traffic patterns using self organizingmaps. In Fabio Roli and Sergio Vitulano, editors, Proceedingsth International Conference on Image Analysis and Processing -ICIAP , volume of Lecture Notes in Computer Science,pages –, Cagliari, Italy, Sept. b. Springer. ISBN ---.

Stefano Zanero and Sergio M. Savaresi. Unsupervised learningtechniques for an intrusion detection system. In Proceedings ofthe ACMSymposium on Applied Computing, pages –.ACM Press, . ISBN ---.

Index

-day,

anonymization, ANSI, , AR, ARMAX, ASCII, ,

backdoors, , BIC, BMU, , –botnets, bsdtar, , , , , ,

, , BSM, , ,

CDF, CDX, Chebyshev, , CPU, , , CTF,

DARPA, DB, , DNS, DOM, , DR, , –, , , ,

,

ELF, ,

execve, ,

fdformat, FN, FNR, FP, , , , , , , ,

, , , , ,, , ,

FPR, –, , , FreeBSD, , , , FSA, , , , , ,

, FTP, , ,

GCI, , , GCT, , , , , ,

HIDS, , , HMM, , , , , ,

, , , HTML, , , , ,

HTTP, , , , , , ,

, , –, ,, , , , ,, , , ,, , , ,–, –

I

ICD, , IDEVAL, , , , , ,

, , , , ,, –, , ,, , , ,–

IDMEF, , , , IDS, , , , , , ,

, , , , ,, , , –,, , , ,, , –,, , ,

IP, , IRC,

Jaccard, , JavaScript, , , , JSON,

KS, ,

LERAD, , LibAnomaly, , , , ,

, ,

malware, iv, , , , ,, ,

Masibty, –mcweject, , , MDL, ML,

NIDS, , , NTP,

overfitting, , , , ,,

PC, PDF, , ,

phishing, PHP, PID,

ROC, –, , ,

shellcode, , , , ,,

Silverlight, SOM, –, , , ,

, SQL, , SSAADE, SSH, , SVN, SyscallAnomaly, , , ,

, , –, ,

syscalls, , , ,

TCP, , , tcpdump, telnetd, TN, TP, , , TTL,

undertraining, , , ,,

URL, , userland, , , , ,

, userspace,

VPN,

webanomaly, , , , ,, , , –

XML, , XSS,

List of Acronyms

AIC Akaike Information Criterion . . . . . . . . . . . . . . . . . . . . . . . .

ARMAX Auto Regressive Moving Average eXogenous . . . . . .

AR Auto Regressive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ARR Alert Reduction Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ANSI American National Standard Institute . . . . . . . . . . . . . . . .

ASCII American Standard for Information Interxchange . . . . .

BIC Bayesian Information Criterion . . . . . . . . . . . . . . . . . . . . . . .

BMU Best Matching Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BSM Basic Security Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CDF Cumulative Density Function . . . . . . . . . . . . . . . . . . . . . . . .

CDX Cyber Defense eXercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CPU Central Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CTF Capture e Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

DARPA Defense Advanced Research Projects Agency . . . . . . .

DB DataBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

DNS Domain Name System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

DOM Document Object Model . . . . . . . . . . . . . . . . . . . . . . . . . . .

L A

DoS Denial of Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

DR Detection Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ELF Executable Linux Format . . . . . . . . . . . . . . . . . . . . . . . . . . .

FN False Negative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

FNR False Negative Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

FPR False Positive Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

FP False Positive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

FSA Finite State Automaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

FTP File Transfer Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

GCI Granger Causality Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

GCT Granger Causality Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

HIDS Host-based Intrusion Detection System . . . . . . . . . . . . . .

HMM Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . .

HTML HyperText Markup Language . . . . . . . . . . . . . . . . . . . .

HTTP HyperText Transfer Protocol . . . . . . . . . . . . . . . . . . . . . . . .

ICD Idealized Character Distribution . . . . . . . . . . . . . . . . . . . . .

IDEVAL Intrusion Detection eVALuation . . . . . . . . . . . . . . . . .

IDMEF Intrusion Detection Message Exchange Format . . . . .

IDS Intrusion Detection System . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ID Intrusion Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

IETF Internet Engineering Task Force . . . . . . . . . . . . . . . . . . . . .

ISP Internet Service Provider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

IP Internet Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

IRC Internet Relay Chat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

JSON JavaScript Object Notation. . . . . . . . . . . . . . . . . . . . . . . . .

KS Kolmogorov-Smirnoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

LERAD Learning Rules for Anomaly Detection . . . . . . . . . . . .

MDL Minimum Description Length. . . . . . . . . . . . . . . . . . . . . . .

ML Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

NIDS Network-based Intrusion Detection System . . . . . . . . . .

NTP Network Time Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . .

PC Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

PDF Probability Density Function . . . . . . . . . . . . . . . . . . . . . . . .

PHP PHP Hypertext Preprocessor . . . . . . . . . . . . . . . . . . . . . . . .

PID Process IDentifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ROC Receiving Operating Characteristic

SOM Self Organizing Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SQL Structured Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . .

SADE Syscall Sequence Arguments Anomaly Detection En-gine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SSH Secure SHell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SVN SubVersioN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

TCP Trasmission Control Protocol . . . . . . . . . . . . . . . . . . . . . . . . .

TN True Negative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

TP True Positive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

TTL Time To Live . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ULISSE Unsupervised Learning IDS with -Stages Engine . .

URL Uniform Resource Locator . . . . . . . . . . . . . . . . . . . . . . . . . . . .

L A

VPN Virtual Private Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

XML eXtensible Markup Language . . . . . . . . . . . . . . . . . . . . . . . .

XSS Cross-Site Scripting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

List of Figures

. Illustration taken from [Holz, ] and c⃝ IEEE.Authorized license limited to Politecnico di Milano. .

. Generic and comprehensive deployment scenario forIDSs. . . . . . . . . . . . . . . . . . . . . . . . . . .

. Abstract I/O model of an IDS. . . . . . . . . . . . . . . e ROC space. . . . . . . . . . . . . . . . . . . . . . . Abstract I/O model of an IDS with an alert correlation

system. . . . . . . . . . . . . . . . . . . . . . . . . . . A data flow example with both unary and binary relations. . telnetd: distribution of the number of other system

calls among two execve system calls (i.e., distance be-tween two consecutive execve). . . . . . . . . . . . . .

. Probabilistic tree example. . . . . . . . . . . . . . . . . Example of Markov model. . . . . . . . . . . . . . . . . Measured sequence probability (log) vs. sequence length,

comparing the original calculation and the second vari-ant of scaling. . . . . . . . . . . . . . . . . . . . . . .

. e high-level structure of our prototype. . . . . . . . . Comparison of the effect on detection of different prob-

ability scaling functions. . . . . . . . . . . . . . . . . . Sample estimated Gaussian intervals for string length. . Two different estimations of the edge frequency distri-

bution. Namely, Beta(., .) with thresh-olds [., .] (left) andBeta(.,.)with thresholds [., .] (right). . . .

. Distance computation example. . . . . . . . . . . . .

List of Figures

. An illustration of the in-memory execution technique.

. Overview of web application model construction. . . . . e logical structure of Masibty. . . . . . . . . . . . . . Web client resource path invocation distributions from

a selection of real-world web applications. . . . . . . . . Overall procedure. Profiles, both undertrained andwell-

trained, are collected from a set of web applications.ese profiles are processed offline to generate the globalknowledge base C and index CI . C can then be queriedto find similar global profiles. . . . . . . . . . . . . . .

. Partitioning of the training set Q for various κ. . . . . . Procedure for building global knowledge base indices. . Clustering of C, (a-b), and CI , (c-d). Each leaf (a pro-

file) is labeled with the parameter name and samplesvalues observed during training. As κ increases, pro-files are clustered more accurately. . . . . . . . . . . .

. Plot of profile mapping robustness for varying κ. . . . . Global profile ROC curves for varying κ. In the pres-

ence of severe undertraining (κ ≪ κstable), the systemis not able to recognize most attacks and also reportsseveral false positives. However, as κ increases, de-tection accuracy improves, and approaches that of thewell-trained case (κ = κstable). . . . . . . . . . . . . .

. Relative frequency of the standard deviation of the num-ber of forms (a) and input fields (c). Also, the distribu-tion of the expected time between changes of forms (b)and input fields (d) are plotted. A non-negligible por-tion of the websites exhibits changes in the responses.No differences have been noticed between home pagesand manually-selected pages. . . . . . . . . . . . . . .

. Lines of codes in the repositories of PhpBB, WordPress,and MovableType, over time. Counts include only thecode that manipulates HTTP responses, requests andsessions. . . . . . . . . . . . . . . . . . . . . . . . . .

L F

. A representation of the interaction between the clientand theweb application server, monitored by a learning-based anomaly detector. After request qi is processed,the corresponding response respi is intercepted and linkLi and forms Fi are parsed to update the request mod-els. is knowledge is exploited as a change detectioncriterion for the subsequent request qi+1. . . . . . . .

. Detection and false positive rates measured on Q andQdrift, with HTTP response modeling enabled in (b). .

. Simplified version of the correlation approach proposedin [Valeur et al., ]. . . . . . . . . . . . . . . . . .

. Comparison of crisp (a) and fuzzy (b) time-windows.In both graphs, one alert is fired at t = 0 and anotheralert occurs at t = 0.6. Using a crisp time-windowand instantaneous alerts (a), the distance measurementis not robust to neither delays nor erroneous settingsof the time-window size. Using fuzzy-shaped func-tions (b) provides more robustness and allows to cap-ture the concept of “closeness”, as implemented withthe T-norm depicted in (b). Distances in time are nor-malized in [,] (w.r.t. the origin). . . . . . . . . . . .

. Comparing two possiblemodels of uncertainty on times-tamps of single alerts. . . . . . . . . . . . . . . . . . .

. Non-zero long alert: uncertainty on measured times-tamps are modeled. . . . . . . . . . . . . . . . . . . .

. A sample plot of Belexp(a) (with |a.score−Tanomaly|, FPR ∈[0, 1]) compared to a linear scaling function, i.e. Bellin(a)(dashed line) . . . . . . . . . . . . . . . . . . . . . .

. Hypothetical plots of (a) the globalDR,DRA′ , vs. ARRand (b) the global FPR, FPRA′ , vs. ARR. . . . . . .

. Plot of theDRA′ (a) and FPRA′ (b) vs. ARR. “Crisp”refers to the use of the crisp time-distance aggregation;“Fuzzy” and “Fuzzy (belief )” indicates the simple fuzzytime-distance aggregation and the use of the attack -belief for alert discarding, respectively. . . . . . . . .

. p-value (a) and GCI (b) vs. l (in minutes) for the firstGCT experiment (w = 60.0 seconds): “NetP (k) HostP (k)” (dashed line), “HostP (k) NetP (k)”(solid line). . . . . . . . . . . . . . . . . . . . . . . .

List of Figures

. p-value (a) and GCI (b) vs. l (in minutes) for the firstGranger causality test experiment (w = 1800.0 sec-onds): “NetP (k) HostP (k)” (dashed line), “HostP (k) NetP (k)” (solid line). . . . . . . . . . . . . . . . . .

. p-value (a) and GCI (b) vs. l (in minutes) for the firstGranger causality test experiment (w = 3600.0 sec-onds): “NetP (k) HostP (k)” (dashed line), “HostP (k) NetP (k)” (solid line). . . . . . . . . . . . . . . . . .

. e optimal time lag p = l given by the AIC criterionstrongly varies over time. . . . . . . . . . . . . . . . .

. How delays between network and host alerts are calcu-lated. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. Histograms vs. estimated density (red dashes) and Q-Q plots, for both fO and fP . . . . . . . . . . . . . . .

. Histograms vs. estimated density (red dashes) for bothfO and fP (IDEVAL ) . . . . . . . . . . . . . .

List of Tables

. Duality between misuse- and anomaly-based intrusiondetection techniques. Note that, an anomaly-based IDScan detect “Any” threat, under the assumption that anattack always generates a deviation in the modeled ac-tivity. . . . . . . . . . . . . . . . . . . . . . . . . . .

. Taxonomy of the selected state of the art approachesfor network-based anomaly detection. . . . . . . . . .

. Taxonomy of the selected state of the art approachesfor host-based anomaly detection. Our contributionsare highlighted. . . . . . . . . . . . . . . . . . . . . .

. Taxonomy of the selected state of the art approachesfor web-based anomaly detection. Our contributionsare highlighted. . . . . . . . . . . . . . . . . . . . . .

. Taxonomy of the selected state of the art approaches foralert correlation. Our contributions are highlighted.Our contributions are highlighted. . . . . . . . . . . .

. Comparison of SyscallAnomaly (results taken from [Kruegelet al., a]) and SADE in terms of number FPson the IDEVAL dataset. For SADE , the amount ofsystem calls flagged as anomalous is reported betweenbrackets. . . . . . . . . . . . . . . . . . . . . . . . . .

. A true positive and a FP on eject. . . . . . . . . . . . . True positive on fdformat while opening a localization

file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behavior of SyscallAnomalywith andwithout the Struc-

tural Inference Model . . . . . . . . . . . . . . . . . .

List of Tables

. Percentage of open syscalls and number of executions(per program) in the IDEVAL dataset. . . . . . . . .

. Execution times with and without the heuristic (and inparenthesis, values obtained by performance tweaks) .

. Association of models to system call arguments in ourprototype . . . . . . . . . . . . . . . . . . . . . . . .

. Cluster validation process. . . . . . . . . . . . . . . . . fdformat: attack and consequences . . . . . . . . . . . . DRs and FPRs on two test programs, with (Y) and

without (N) Markov models. . . . . . . . . . . . . . . . Training and detection throughput X . On the same

dataset, the average processing time with SADE dis-abled is around 0.08741084, thus, in the worst case,SADE introduces a 0.1476 overhead (estimated). .

. Parameters used to train the IDSs. Values includes thenumber of traces used, the amount of paths encoun-tered and the number of paths per cycle. . . . . . . . .

. Comparison of the FPR of SADE vs. FSA-DF vs.Hybrid IDS. Values include the number of traces used.Accurate description of the impact of each individualmodel is in Section ... . . . . . . . . . . . . . . .

. Parameters used to train the IDSs. Values includes thenumber of traces used, the amount of paths encoun-tered and the number of paths per cycle. . . . . . . . .

. Comparison of the FPRof SADE vs. SOM-SADE. Values include the number of traces used. . . . . . .

. Detection performance measured in “seconds per sys-tem call”. e average speed is measured in system callsper second (last column). . . . . . . . . . . . . . . . .

. Experimental results with a regular shellcode and withour userland exec implementation. . . . . . . . . . . .

. Client access distribution on a real-world web applica-tion (based on , requests per day). . . . . . . .

. Reduction in the false positive rate due to HTTP re-sponse modeling for various types of changes. . . . . .

. Estimated parameters (and their estimation standarderror) for the Gamma PDFs on IDEVAL and. . . . . . . . . . . . . . . . . . . . . . . . . . .

Colophon

is document was typeset using the XETEX typesettingsystem created by the Non-Roman Script Initiative and thememoir class created by Peter Wilson. e body text is setpt with Adobe Caslon Pro. Other fonts include Envy CodeR, Optima Regular and Commodore 64 Pixeled.Most of the drawings are typeset using the TikZ/PGF pack-ages by Till Tantau.