learning from sequential data for anomaly detection349793/fulltext.pdflearning from sequential data...

LEARNING FROM SEQUENTIAL DATA FOR ANOMALY

DETECTION

A Dissertation Presented

by

Esra Nergis Yolacan

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in

Electrical and Computer Engineering

in the field of

Computer Engineering

Northeastern University

Boston, Massachusetts

October 2014

c© Copyright 2015 by Esra Nergis Yolacan

All Rights Reserved

ii

Abstract

Anomaly detection has been used in a wide range of real world problems and has

received significant attention in a number of research fields over the last decades.

Anomaly detection attempts to identify events, activities, or observations which are

measurably different than an expected behavior or pattern present in a dataset. This

thesis focuses on a specific set of techniques targeting the detection of anomalous

behavior in a discrete, symbolic, and sequential dataset. Since profiling complex

sequential data is still an open problem in anomaly detection, and given that the rate

of production of sequential data in fields ranging from finance to homeland security

is exploding, there is a pressing need to develop effective detection algorithms that

can handle patterns in sequential information flows.

In this thesis, we address context-aware multi-class anomaly detection as applied

to discrete sequences and develop a context learning approach using an unsupervised

learning paradigm. We begin the anomaly detection process by applying our approach

to differentiate normal behavior classes (contexts) before attempting to model normal

behavior. This approach leads to stronger learning on each class by taking advantage

of the power of advanced models to identify normal behavior of the sequence classes.

We evaluate our discrete sequence-based anomaly detection framework using two

illustrative applications: 1) System call intrusion detection and 2) Crowd anomaly

iii

detection. We also evaluate how clustering can guide our context-aware methodology

to positively impact the anomaly detection rate.

In this thesis, we utilize a Hidden Markov Model (HMM) to perform anomaly

detection. A HMM is the simplest dynamic Bayesian network. A HMM is a Markov

model which can be used when the states are not observable, but observed data is

dependent on these hidden states. While there has been a large amount of prior

work utilizing Hidden Markov Models (HMMs) for anomaly detection, the proposed

models became overly complex when attempting to improve the detection rate, while

reducing the false detection rate.

We apply HMMs to perform anomaly detection on discrete sequential data. We

utilize multiple HMMs, one for each context class. We demonstrate our multi-HMM

approach to system call anomalies in cyber security and provide results in the presence

of anomalies. Applying process trace analysis with multi-HMMs, system call anomaly

detection achieves better results using better tuned model settings and a less complex

structure to detect anomalies.

To evaluate the extensibility of our approach, we consider a second application,

crowd behavior analytics. We attempt to classify crowd behavior and treat this as an

anomaly detection problem on sequential data. We convert crowd video data into a

discrete/symbolic sequence of data. We apply computer vision techniques to generate

features from objects, and use these features for frame-based representations to model

the behavior of the crowd in a video stream. We attempt to identify anomalous

behavior of a crowd in a scene by applying machine learning techniques to understand

what it means for a video stream to be identified as “normal”. The results of applying

our context-aware multi-HMMs approach to crowd analytics show the generality of

our anomaly detection approach, and the power of our context-learning approach.

iv

Acknowledgements

In the name of God, the Most Gracious, the Most Merciful.

.

I dedicate this thesis to my beloved husband, Riza, from the depths of my heart

and soul. You have supported me throughout everything, I could not have accom-

plished this without you. Thank you for your remarkable patience and unwavering

love during this doctoral journey. To my loving parents, Nermin and Feridun. You

both have successfully made me the person I am becoming by instilling the impor-

tance of hard work and higher education. Thank you for being my inspiration and

a wonderful role model. To my precious brother, Emre. You have always been there

cheering me up and stood by me through the good times and bad. Thank you for

never ending motivations and always believing in me. I am grateful to all four of you

for your presence and I love you more than you will ever know. Thanks for all of your

endless love, support and encouragement.

I would like to express my gratitude to my advisor, Prof. Dr. David R. Kaeli,

for his guidance, understanding, and patience during five years of my dissertation.

Thank you for being so supportive by giving advices, providing persistent help and

encouraging me in order to complete this task. I would also like to thank my com-

mittee members, Prof. Dr. Jennifer G. Dy, and Dr. Fatemeh Azmandian for their

v

precious time and guidance throughout this dissertation. Your thoughtful comments

were invaluable. I thank all my dear colleagues at NUCAR group for valuable dis-

cussions, suggestions, and the most importantly, their friendship during my studies

at Northeastern University. I would like to thank Ayse Yilmazer, for being with me

and sharing her experiences during my research. Additionally, I would like to thank

our graduate coordinator Faith Crisley for her advice and assistance during the years

of my study.

vi

Contents

Abstract ii

Acknowledgements iv

1 Introduction 1

1.1 Anomaly Detection on Sequential Data . . . . . . . . . . . . . . . . . 2

1.2 Challenges of Working with Sequential Data . . . . . . . . . . . . . . 9

1.3 Contributions of the Work . . . . . . . . . . . . . . . . . . . . . . . . 14

1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . . . 16

2 Background 18

2.1 Intrusion Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1.1 Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.2 Detection Method . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.3 Response Type . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Crowd Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 People Counting/Density Estimation . . . . . . . . . . . . . . 24

2.2.2 People Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2.3 Behavior Learning . . . . . . . . . . . . . . . . . . . . . . . . 26

vii

2.3 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Anomaly Detection Algorithms . . . . . . . . . . . . . . . . . 29

2.3.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.3 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 31

2.3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Related Work 40

3.1 Related Work in System Call Analysis . . . . . . . . . . . . . . . . . 40

3.1.1 Data Representation in System Call Analysis . . . . . . . . . . 41

3.1.2 HMM in System Call Analysis . . . . . . . . . . . . . . . . . . 43

3.2 Related Work in Crowd Analysis . . . . . . . . . . . . . . . . . . . . 46

3.2.1 Data Representation in Crowd Analysis . . . . . . . . . . . . . 46

3.2.2 HMM in Crowd Analysis . . . . . . . . . . . . . . . . . . . . . 49

3.3 Related Work in Context-aware Systems . . . . . . . . . . . . . . . . 51

3.3.1 Context-aware Applications . . . . . . . . . . . . . . . . . . . 52

3.3.2 Context Inference . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 Context Learning 56

4.1 Context in a Symbolic Sequential Data . . . . . . . . . . . . . . . . . 56

4.2 Clustering for Context Learning . . . . . . . . . . . . . . . . . . . . . 57

4.3 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.1 Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.2 Required Length (Number of Symbols) . . . . . . . . . . . . . 63

4.4 Summary of Context Learning . . . . . . . . . . . . . . . . . . . . . . 68

5 System Call Anomaly Detection 70

5.1 System Call Trace Dataset . . . . . . . . . . . . . . . . . . . . . . . . 72

viii

5.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.1 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.2 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.3 Clustering for Context Learning . . . . . . . . . . . . . . . . . 75

5.2.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3 Behavior Learning -Training . . . . . . . . . . . . . . . . . . . . . . . 79

5.4 Test and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 80


5.6 Summary of System Call Trace Analysis . . . . . . . . . . . . . . . . 85

6 Crowd Anomaly Detection 88

6.1 Event Recognition Video Dataset . . . . . . . . . . . . . . . . . . . . 89

6.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.2.1 Feature Extraction from a Video . . . . . . . . . . . . . . . . 93

6.2.2 Clustering for Context Learning . . . . . . . . . . . . . . . . . 98

6.2.3 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.3 Behavior Learning and Anomaly Detection . . . . . . . . . . . . . . . 103

6.4 Testing and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 104


6.6 Summary of Crowd Analysis . . . . . . . . . . . . . . . . . . . . . . . 110

7 Summary and Conclusion 114

7.1 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 115

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Bibliography 120

ix

List of Figures

1.1 A general scheme of anomaly detection. . . . . . . . . . . . . . . . . . 3

1.2 Example of four different symbolic discrete sequences. . . . . . . . . . 6

1.3 An example of transitional probability-based features - two-state Markov

Chains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 ROC curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3 K-fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Design of our system call anomaly detection framework. . . . . . . . . 71

5.2 BIC values for various number of Hidden States . . . . . . . . . . . . 80

5.3 ROC curve for Set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4 ROC curve for Set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5 ROC curve for Set 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.6 ROC curve for one HMM . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.7 Time-series plot for process traces. . . . . . . . . . . . . . . . . . . . 85

5.8 Process-based evaluation results. . . . . . . . . . . . . . . . . . . . . . 86

6.1 Design of our crowd anomaly detection framework. . . . . . . . . . . 89

6.2 Dataset-S3 Events, frames 50 and 150 (left-to-right) [52]. . . . . . . . 91

x

6.3 Feature extraction from a video sequence. . . . . . . . . . . . . . . . 94

6.4 Top-down view of the surveillance area and camera position. . . . . . 95

6.5 Projective transformation (Homography) result of an original image

frame from PETS09 event recognition dataset. . . . . . . . . . . . . . 96

6.6 ROC curve for the velocity-based symbolic sequences, training and

testing with only a single model. . . . . . . . . . . . . . . . . . . . . . 106

6.7 ROC curves for velocity-based symbolic sequences, training and testing

on a per-context basis. . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.8 The ROC curve for direction-based symbolic sequences, training and

testing using only a single model. . . . . . . . . . . . . . . . . . . . . 108

6.9 ROC curves for direction-based symbolic sequences, training and test-

ing on a per-context basis. . . . . . . . . . . . . . . . . . . . . . . . . 109

6.10 ROC curve for distance based symbolic sequences, training and testing

with only a single model. . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.11 ROC curves for distance-based symbolic sequences, training and test-

ing on a per-context basis. . . . . . . . . . . . . . . . . . . . . . . . . 111

6.12 High-level design of crowd anomaly detection framework. . . . . . . . 112

xi

List of Tables

1.1 An example of frequency vector based features . . . . . . . . . . . . . 7

1.2 An example of distance based features. . . . . . . . . . . . . . . . . . 7

1.3 An example of user command sequences . . . . . . . . . . . . . . . . 11

1.4 Unique subsequences: (normal dataset) . . . . . . . . . . . . . . . . . 12

1.5 Subsequences extracted from test sequence . . . . . . . . . . . . . . 12

1.6 Clustered sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.7 Unique subsequences from the clustered sequences: (normal dataset). 13

2.1 Categorization of Intrusion Detection Systems . . . . . . . . . . . . . 19

2.2 Video components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.1 A sample of the UNM program trace. . . . . . . . . . . . . . . . . . . 72

5.2 A sample of extracted process traces. . . . . . . . . . . . . . . . . . . 72

5.3 System call sequence for PID:552. . . . . . . . . . . . . . . . . . . . . 73

5.4 The UNM sendmail trace dataset. . . . . . . . . . . . . . . . . . . . . 74

5.5 Clustering results of UNM sendmail processes . . . . . . . . . . . . . 76

6.1 Video sequences in the PETS09 Dataset-S3:High Level. . . . . . . . . 92

6.2 Time intervals of the events in the PETS09 Dataset-S3:High Level. . 92

xii

6.3 Non-overlapping sequence partitions generated from the PETS09 Dataset-

S3: High Level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.4 Normal sequence partitions for velocity-based events. . . . . . . . . . 102

xiii

Chapter 1

Introduction

The goal of anomaly detection is to identify anomalous behavior, events or items based

on deviations from expected normal cases. Anomaly detection is a research area that

has been studied extensively for a range of application domains, such as computer

and network monitoring for intrusion detection, video and image processing for crowd

analytics, activity monitoring for fraud detection, DNA analysis for mutation and

disease detection, bio-surveillance for disease outbreak detection, and sensor data

analysis for fault diagnosis. We can select a particular method to address these

problems, though the same method can be applied to other domains given a similar

data representation, problem formulation and nature of the anomalies. Particular

anomaly detection processes include outlier detection, novelty detection, deviation

detection and exception mining. The processes differ based on the application domain

and the employed detection approaches [65]. In our work, we have used the term

anomaly detection to describe the process of differentiating abnormal behavior from

normal behavior in a problem-relevant dataset.

1

1.1 Anomaly Detection on Sequential Data

Sequential data is a valuable source of information which is available in many aspect

of our lives, including weather prediction [126, 102], frequency analysis for phonetic

speaker recognition [22], unusual human action detection in a video [80], pattern

discovery for product placement in supermarkets [6, 56], and detection of mutations

in a gene sequence [79, 91].

Sequences can be discrete or continuous in terms of the value they take for each

uniform time interval. A continuous sequence, also known as time series, is a sequence

of data points which are obtained by measuring a variable at discrete time points [29].

A data point in a continuous sequence may take on any value within a certain range

for the measured variable, such as the height of a tree year after year or the daily air

temperature of a city. To perform a learning task in continuous sequences, researchers

may prefer to apply dimensionality reduction and discretization methods to obtain a

discrete representation [77, 90, 56].

A discrete sequence is an ordered series of symbols which can be characters, num-

bers or words [147]. A data point in a discrete sequence may only take certain values

which is limited with an alphabet, such as a gene is a sequence of DNA nucleotides,

a program execution trace is the sequence of system calls. In symbolic sequences,

typically the value of a data point is not meaningful individually, but it provides a

valuable information when considered with the other symbols in the sequence.

In this thesis, we demonstrate our approach using discrete (symbolic) sequences to

perform the anomaly detection task in two different application domains: 1) intrusion

detection for cyber security and 2) crowd behavior detection for physical security. In

anomaly detection-based intrusion detection, we use a sequence of system calls as

instances in a discrete (symbolic) sequence to detect abnormal program behavior.

2

In crowd anomaly detection, we extract features frame-by-frame from a crowd video

dataset and use those features as a discrete sequence of symbols to detect crowd

behavior.

We present a general scheme of anomaly detection process in Figure 1.1. The

nature of the anomaly detection process requires a well-defined profile to learn the

normal behavior. To address this issue, machine learning [134], data mining [48]

and statistical methods [113] are some of the techniques used to profile the normal

behavior during anomaly detection. The extracted profile can be anything that can

distinguish normal and abnormal behaviors such as pattern sets, rule sets, probability

distributions, statistical models, etc..

A general definition for describing the anomaly detection task on a discrete se-

quence of symbols can be stated as:

Definition 1.1 Given a single sequence (S) as S = {s1, s2, s3..., sn}, (si is a symbol

from a finite alphabet∑

), where n is the number of symbols in S and i = 1, 2, ...n;

then, anomaly detection is the task of deciding whether S is normal or abnormal with

respect to learned normal behavior.

Anomaly detection techniques generally use a threshold value to raise an alarm

to decide whether to flag an anomaly. Typically there are two different approaches

Figure 1.1: A general scheme of anomaly detection.

3

for the evaluation of anomalies in a discrete sequence of data. The first approach is

based on assigning an anomaly score to the entire sequence. If the anomaly score

is higher than a predefined threshold, then the sequence is labeled as abnormal [64].

In this approach, normal sequences are expected to have a lower anomaly score than

the ones that include an anomaly. An anomaly detection technique in this category

needs to apply normalization, compensating for the sequence length to provide a

fair evaluation. Although this normalization approach eliminates the impact of False

Positives (FP) in a normal sequence, it may also eliminate the True Positives (TP)

in an abnormal sequence. For example, if an abnormal sequence is too long, the

anomaly score may never reach the threshold value. Therefore, the success of this

kind of evaluation depends on the density of abnormal events in the entire sequence

and it is difficult to determine where the anomaly starts in a sequence of data.

The second approach is based on performing an evaluation on regions of the se-

quence to compute an anomaly score [63]. In this approach, abnormal portions of

a sequence can be detected when the anomaly score of the evaluated region reaches

some predefined threshold. If desired, the anomaly score for the entire sequence can

be obtained by combining all anomaly scores assigned to the regions. Three of the

advantages of this approach (and which motivate our work) are listed as follow:

• First, since region-based analysis simplifies the data, it allows a wider array of

techniques to be applied.

• Second, since this approach works on only a small portion of the data, it enables

us to detect local anomalies which would be missed in the first approach.

• Third, since this analysis approach computes anomaly scores in regions, it per-

forms anomaly detection task without needing to examine the entire sequence.

4

Typically, an anomaly detection is a time-sensitive task, especially when it is

applied for security purposes. When an anomaly is detected, we then need to termi-

nate the anomaly, to eliminate or minimize its effects and investigate the reason that

caused the anomaly. Region-based evaluation that works with a partition of sequence

is well-suited for real-time anomaly detection applications, but detection still needs

to be implemented using efficient algorithms.

In this thesis, we present an approach that performs a region-based evaluation

by using subsequences generated from discrete sequences via a fixed-size windowing

technique. We use the term sequence to refer to entire symbol sequence in a sequence

data set and the term subsequence to refer to a shorter sequence which comprises

consecutive symbols.

The aim of our work is to provide a generic and effective anomaly detection

framework for discrete sequences:

• The technique should be generic because it must be applied to discrete sequence

data collected from various domains. We evaluate our discrete sequence-based

anomaly detection framework by using two illustrative application problems:

1) Intrusion detection and 2) Crowd anomaly detection.

• The technique should be effective because a sequence-based anomaly detection

system must learn normal sequences from a given sequence of the dataset in

order to be able to detect anomalies in the same manner as a human expert

would detect.

The effectiveness of an anomaly detection process relies on how well the model is

designed. Therefore, the main challenge in anomaly detection is extracting beneficial

information from the given sequences. To begin to pursue the anomaly detection

5

problem by analyzing the ordered sequence of symbols, we need to use methods

specifically designed for this purpose.

Figure 1.2: Example of four different symbolic discrete sequences.

To highlight the importance of the methods selected, we start with an illustrative

example. In Figure 1.2, we have four equal-length discrete sequences generated from

a length two alphabet∑

= {A,B}. Even though it seems obvious that each sequence

has different characteristics, it is essential to select an effective method to differen-

tiate between these sequences. In this example, we contrast utilizing vector-based

extraction, distance-based and relation-based features to differentiate the sequences.

For vector-based feature extraction, we count the number of occurrences of each sym-

bol in the sequence. For distance-based feature extraction, we use a basic similarity

measure to compute the distance between given sequences. For relation-based feature

extraction, we used the transition information between the symbols.

First, a frequency-based feature vector is calculated and presented for each se-

quence in Table 1.1. Although the order of symbols in the given sequences are differ-

ent from each other, the extracted feature vectors are identical, and thus we would

6

be unsuccessful to differentiate between these sequences.

Table 1.1: An example of frequency vector based features

Sequences A B

Sequence 1 8 8

Sequence 2 8 8

Sequence 3 8 8

Sequence 4 8 8

Second, we use Hamming Distance to compute the distances between a sequence

and the other sequences. Hamming distance is a simple distance measure that counts

the number of places where the symbols mismatch. In Table 1.2, each column shows

the distances between the sequence and the other sequences. Each sequence has

the same distance to each sequence in the example. While there are more sophisti-

cated and domain specific similarity/distance measures which can be used to evaluate

discrete sequences, they do not consider the transitional probabilities between the

symbols in a sequence. Instead of extracting behavioral information, they are more

appropriate to compute a similarity or a distance between two sequences.

Table 1.2: An example of distance based features.

Sequences Sequence 1 Sequence 2 Sequence 3 Sequence 4

Sequence 1 0 8 8 8

Sequence 2 8 0 8 8

Sequence 3 8 8 0 8

Sequence 4 8 8 8 0

7

Figure 1.3: An example of transitional probability-based features - two-state MarkovChains.

Third, we use the relationship between symbols. To extract behavioral information

from a sequence, we need to consider the ordering of symbols in a sequence. A

fundamental way to perform a behavior extraction process is calculating transitional

probabilities between the symbols in a sequence. To model the relation features

based on the transitions between symbols, we use Markov Chain. We use a first-order

Markov process, the simplest Markov model to represent the transitional probabilities

from one state (symbol) to another [20]. In Figure 1.3, we show the two-state Markov

chain used to compute the probability of next symbol, which only depends the current

symbol. Using the ordering information present in a symbolic sequence, we will be

8

able to differentiate between sequences and learn the system behavior.

A Hidden Markov Model (HMM), the simplest dynamic Bayesian network, is a

Markov model which can be used when states in a process are not observable, but

observed data is dependent on these hidden states. HMMs rely on two properties:

1) the observation at time t was generated by some process whose state Ht is hid-

den from the observer, and 2) the state of the hidden process satisfies the Markov

property. In this thesis, we applied HMMs to generate the normal behavior model of

subsequences for two different application domains: cyber-security intrusion detec-

tion and crowd anomaly detection. Since we collect only symbolic sequences to detect

anomalies, the states that generate these symbols are hidden in our case. A HMM

is well-suited to our problem under these circumstances, because the sequential data

we use in these domains can be considered as observation data which is used to learn

and detect actual underlying program and crowd behaviors.

1.2 Challenges of Working with Sequential Data

Next, we focus on the challenges of using discrete (symbolic) sequential data for

anomaly detection. Although there are many algorithms and approaches to use in

the anomaly detection task, they are often not directly applicable to working with

sequential data, since they only take input data as a vector of features. It is possible

to generate a feature vector for sequence data instances, but this is undesirable for

two reasons: First, when we consider a feature vector extraction by transforming

the data (e.g., extracting the frequency distribution of the symbols in a sequence),

the dimension of the vector is dependent on the alphabet size, and this can increase

the computation cost. Second, sequential data includes transition-based features for

9

the sequence. When we transform the data into a vector of features, it is no longer

possible to take advantage of the probabilistic structure of sequences. To extract the

behavior of a data sequence, we need to apply machine learning techniques without

loosing the transitional features in the symbol sequences.

Another challenge encountered when using sequential data can be seen when com-

puting a similarity or dissimilarity measure between two sequences. Although there

are various similarity/dissimilarity metrics in the literature, most of them are not

applicable directly to sequential data or cannot satisfy the mandatory condition of

providing a real distance metric [61]. Also, typically there is a need to perform pre-

processing to compute a similarity/dissimilarity measure, since the sequences in a

dataset almost always differ in length or they could be too long to perform such

measures.

From the dataset perspective, the labeling process is often challenging, even for

vector instances. When we consider sequential data, it is even more difficult to label

abnormal segments in the sequence, since the distinction between normal and abnor-

mal data are imprecise and could be interspersed throughout the whole sequence.

All of these issues make sequential data analysis a complex and challenging pro-

cess. In this thesis, we address an additional challenge when working with sequential

data, which affects our ability to detect discrete sequences. In previous work on

anomaly detection with sequential data, researchers typically assumed that there is

only a single normal behavior if the data is from a single data source [32]. But in

our analysis, we found that data sequences can include multiple normal behaviors.

For example, a sequence of computer user commands would be used to learn normal

behavior of a user and the learned behavior would be used later for anomaly detec-

tion [31]. The computer user would have different tasks to complete on different days,

10

therefore the distribution of user commands would change according to schedule of

the work that needs to be done on each successive day. In this case, learning a normal

behavior for each similar work day should lead to improved detection accuracy versus

learning only a single normal behavior for all days.

We present an example to illustrate this problem. Lets assume that we have some

normal discrete sequences generated from an alphabet of user commands ( Σ : A, B, C)

as displayed in Table 1.3. In order to generate a dictionary-based model (i.e., a set of

patterns) from these normal sequences as shown in Table 1.4, we extract subsequences

using a sliding window of length 3.

Table 1.3: An example of user command sequences

1. BACBABCB

2. ABACBABCBABACBABCBA

3. CBABABCBAC

4. ACBABACBABABABCBA

5. AABBAABBAABBAACCBB

6. BBCCAACCAABBAA

7. CCAABBAACCAABBAACC

...

n. ...

Since we extract normal subsequences, we can detect anomalies in a test sequence

by following the same preprocessing steps and then comparing the test subsequences

with the normal data dictionary. For a given test sequence: ‘CCBBAABBAAB-

BAACCBBACBAC’, the generated subsequences are presented in Table 1.5. If we

compare the subsequences with the dictionary, we will find that there are no abnormal

11

Table 1.4: Unique subsequences: (normal dataset)

ABA, BAC, ACB, CBA, BAB, ABC, BCB,

ACB, AAB, ABB, BBA, BAA, AAC, ACC

CCB, CBB, BBC, BCC, CCA, CAA

sequences since all of the subsequences appear in the normal dataset.

Table 1.5: Subsequences extracted from test sequence

CCB, CBB, BBA, BAA, AAB, ABB, BBA,

BAA, AAB, ABB, BBA, BAA, AAC, ACC,

CCB, CBB, BBA, BAC, ACB, CBA, BAC

If we examine the normal sequences in detail, we can see the differences between

them and the test sequence. On the other hand, we can see that the given test

sequence starts with repeated symbol sequences, but includes the other structure as

well. This kind of anomaly can be detected only if the sequences are clustered to

train a model for each cluster. The results of clustering and the dictionary for each

cluster are presented in Tables 1.6 and 1.7, respectively.

If we classify the test sequence to evaluate the subsequences with the dictionary

of the relevant cluster, it falls into the 2. Cluster class, since it is more similar than

the other cluster. Since the test sequence is classified into the second cluster we

can evaluate it by using the normal dictionary patterns from the 2. Cluster. The

subsequences BAC, ACB, CBA and BAC in the test sequence do not appear in the

2. Cluster’s normal pattern set, therefore we can detect those subsequences since they

are not expected in this model.

12

Table 1.6: Clustered sequences

1. Cluster 2. Cluster

BACBABCB AABBAABBAABBAACCBB

ABACBABCBABACBABCBA BBCCAACCAABBAA

CBABABCBAC CCAABBAACCAABBAACC

ACBABACBABABABCBA ...

...

Table 1.7: Unique subsequences from the clustered sequences: (normal dataset).

1. Cluster ABA, BAC, ACB, CBA, BAB, ABC, BCB, ACB

2. Cluster AAB, ABB ,BBA, BAA, AAC, ACC, CCB, CBB, BBC, BCC, CCA, CAA

We presented a simple example to explain the motivation of our context learn-

ing (clustering) approach. Although, there are only slight differences between the

sequences collected from real data sources, a dedicated clustering technique may be

used for categorization of sequences.

Note that the given problem is different than a typical context-based anomaly

detection in two ways: 1) defining a context is not straightforward since we work

with discrete sequences that only include the symbols, and there is no other context

attribute, and 2) a context-based anomaly detection on sequential data generally

depends on surprise detection with respect to the previous data in the same sequence.

On the other hand, our contextual anomaly detection approach takes into account

other similar sequences in the same category, instead of relying solely on the behavior

of the current sequence.

13

In this thesis, we address the context-based multi-class anomaly detection prob-

lem on discrete sequences. We accomplish this by applying a clustering approach to

differentiate normal behavior classes (contexts) before implementing an anomaly de-

tection task. This approach provides a better learning model for each class by taking

advantage of the power of HMMs to capture the normal behavior of the sequence

classes.

This thesis enhances the current state-of-the-art and makes key contributions to

the following areas:

• Machine learning/ Data mining approach: Devising a novel, effective and generic

context-aware anomaly detection framework for discrete sequential data by

adapting sophisticating machine learning algorithms.

• Application domains: Applying the presented anomaly detection approach to

address two different application domains, each of which is critical for security.

Next, we discuss the major contributions of this work and describe the organiza-

tion of the remainder of the thesis.

1.3 Contributions of the Work

The thesis consists of two parts: Part one focuses on the basic research issues arising

in sequential data anomaly detection and presents a generic framework for anomaly

detection on discrete sequences. Part two evaluates the proposed sequential anomaly

detection approach to build real applications in multiple domains: cyber-security and

physical security.

The contributions of the thesis are:

14

We address potential issues in behavioral learning where the data is sequential.

Most existing classification-based anomaly detection techniques are not appropriate

to apply directly to sequence-based anomaly detection because of many limitations. In

this thesis, we investigate the challenges of working with sequential data to understand

these limitations and to provide an anomaly detection method better suited for this

task.

We present a novel framework for context-based sequence anomaly detection, where

defining a context is not straightforward task. Typically, a context-aware analysis

should include a context attribute to differentiate contexts and to learn a behavior

for each context. For example, daily temperature of an area over last few years could

be analyzed in the context of seasons or months. In our case, discrete sequences do

not include any defining attributes to identify a context in a straightforward fashion.

We fully explore the idea of context learning for incomplete discrete sequences

where the data is in streaming form. Most existing sequence clustering techniques fail

to account for incomplete sequences, and are limited because of length and alignment

issues. We provide a fundamental methodology for context learning by clustering

sequences based on their similarities of the first l symbols by assuming the sequences

which are generated from a common distribution will start with similar symbol order-

ings. l is a parameter to select the subsequence length for the similarity computation

which would differ according to sequence domain. We should select an l long enough

to capture similarities between sequences and small enough to provide online cluster-

ing for context learning on streaming data.

We develop an anomaly detection-based intrusion detection method where the data

is symbolic sequences of system calls gathered from program execution traces. We

evaluated our context-based anomaly detection approach by experimenting on a well

15

known benchmark dataset taken from system call traces of Unix privileged programs.

This work has been published the proceedings of the 2014 Software Security and

Reliability Conference (SERE) [155].

We illustrate the generality of our approach by applying it the problem of crowd

anomaly detection. We review competing techniques and approaches for crowd anal-

ysis. Our proposed method learns crowd behavior from symbolic sequences in order

to detect abnormal crowd activities. To obtain a symbolic sequence from a video, we

represent global events occurring in a scene by using symbolic data for each video

frame.

1.4 Organization of Dissertation

In this chapter, we introduce the problem of anomaly detection on discrete/symbolic

sequential data. We discuss the challenges of working on sequential data and also

provide the motivation behind using HMMs for system call and crowd event anomaly

detection. The rest of the dissertation is organized as follows:

In Chapter 2, we provide background material to this thesis. This includes a

survey of intrusion detection and crowd analysis methods, as well as in-depth expla-

nation of the techniques that have been proposed in prior work. We also provide a

background of the anomaly detection algorithms and evaluation techniques used in

this dissertation.

In Chapter 3, we discuss the related research in system call analysis, crowd analysis

and context-aware systems.

In Chapter 4, we present an automated context learning approach. We detail the

design challenges and solutions.

16

In Chapter 5, we describe the presented framework for system call anomaly

detection-based intrusion detection. We evaluate the accuracy of our implementa-

tion using a commonly used system call dataset.

In Chapter 6, we describe the presented framework for crowd anomaly detection.

We evaluate the accuracy of our implementation using a public benchmark video

dataset.

In Chapter 7, we conclude the thesis and summarize our work. We also suggest

possible topics for future research.

17

Chapter 2

Background

In this chapter, we present background information on concepts used throughout this

thesis, including intrusion detection, crowd analysis, anomaly detection algorithms

and selected machine learning algorithms.

2.1 Intrusion Detection

Intrusion is defined as “any set of actions that attempt to compromise the integrity,

confidentiality, or availability of a resource” [60]. An intrusion detection is the process

of detecting these actions by monitoring and analyzing the systems [157]. Intrusion

detection system (IDS) is the hardware or software device developed to perform the

intrusion detection process [123].

As listed in Table 2.1, the IDSs are classified from several perspectives, such as

the time of analysis, the structure for the detector, the source of the data, the data

analysis method, or the response to the detection [38, 39, 11].

A categorization of an IDS based on a time of analysis can be online (real-time) or

offline depending on the data monitoring and evaluation methods. While online IDSs

18

provide a real time detection approach to analyze a real-time data feed, offline IDSs

evaluates the stored data periodically. Another dimension to differentiate IDSs is the

structure of IDS. Based on the data collection and preprocessing environment, the

IDS structure criteria categorizes the IDS into two folds: centralized and distributed.

Other categorization criteria given as data, analysis and response in Table 2.1 are

also defined as the functional components of IDSs in [12]. These components are

explained in next subsections.

Table 2.1: Categorization of Intrusion Detection Systems

Time Online OfflineStructure Centralized DistributedData Host NetworkAnalysis Misuse AnomalyResponse Active Passive

2.1.1 Data Source

IDSs can be classified into two main categories depending on data sources and target

system used: 1) host-based and 2) network-based IDSs.

Host-based intrusion detection system:

A host-based intrusion detection system (HIDS) involves monitoring the host system

activities to see if there is a known attack behavior or unusual behavior occurs. The

main role of host-based detection is protecting the host system by analyzing the

monitored data.

19

Network-based intrusion detection system:

A network-based intrusion detection system (NIDS) analyzes network related system

activities to find out the intrusions such as denial of service (DoS), port scan, remote

unauthorized access, etc.

2.1.2 Detection Method

In order to detect an intrusion, there are two different IDS modeling approaches:

1) misuse and 2) anomaly detection.

Misuse detection:

In misuse detection, the IDS is profiled by using the previously known intrusion

behaviors. This profile is called as a signature of the attack. These signatures can be

extracted from any data source which can be used to extract an intrusion pattern,

such as user commands, system calls, audit events, network packets, keystrokes, etc.

Misuse detection is based on looking for the known intrusion patterns in the target

data. If any match appears, the system results in an alarm for the detected intrusion.

This feature sometimes provides to detect different versions of the similar kind of

attacks if the extracted intrusion profile is well generalized to match with the new

versions. The main advantage of signature detection based approaches is the ability of

detecting known intrusions, but they are not capable of detecting novel attacks [89].

Anomaly detection:

In anomaly detection, the IDS is profiled by using the previously known normal be-

havior. Again, profiles are extracted from any data source which can define a normal

behavior. Unlike misuse detection, anomaly detection looks for the mismatches or

20

deviations from a normal profile. According to predefined thresholds, mismatching

behaviors raise an alarm for an anomaly. The main advantage of this approach is the

ability of detecting new types of intrusions, which may result in a deviation from the

normal profile. The main drawback of this approach is the high rate of false positives

which can be caused by an undefined normal behavior or a noise that is not related

to an attack.

2.1.3 Response Type

In general, an IDS has two different responses: 1) active and 2) passive response.

Active response:

An active response of an IDS means providing an automatic response once an attack is

detected. The idea behind an active response is preventing the target system from any

further damages and minimizing the effect of intrusion [24]. These active responses

can be explained by three main categories: collecting additional information to resolve

the attack type and effects, changing the environment by blocking or reconfiguring

the system components, and taking action against the intruder [12]. An IDS with an

active response mechanism needs a real-time detection approach since an the time

gap between the start of an attack and the detection leaves the system vulnerable to

exploitation during that period. A major issue of this approach is the possibility of

an inappropriate response, such as blocking normal traffic, because of an incorrect

implementation or configuration [158].

21

Passive response:

A passive response of an IDS is based on producing an alarm for a detected attack.

This kind of response aims to inform the system user or an administrator rather than

taking actions automatically to prevent an attack [156]. The alarm may come with a

report that includes system logs, potential vulnerabilities and attack types to allow

the administrator to perform a further investigation. A major issue of this approach

is the delay between the intrusion and the human response for critical systems [125].

2.2 Crowd Analysis

Although there is no single definition of a crowd since it changes depending on the

components (such as size, density, time, etc.) that characterize the crowd, a general

definition is given by Chandella et al. [25] as a “sizable number of people gathered at

a specific location for a measurable time period, with common goals and displaying

common behaviors”. In the computer vision domain, this definition can be summa-

rized as a group of individuals in a scene.

Crowd Analysis refers to a collection of methods that focus on studying crowd

related problems. In the last decade, with the growing interest in surveillance of

crowded scenes to provide for increased public safety and the decreasing cost of video

equipments, crowd analysis has become a popular strategy to understand behaviors of

crowd environment automatically. In most previous work, people and their activities

are the main targets for learning, particularly in public places. Crowd analysis can be

used for various kinds of purposes, such as crowd management, public space design,

virtual environments, visual surveillance, and intelligent environments [159].

There are various classes of video components which impact the selection of

22

method and approach taken to perform crowd analysis [23, 159]. We divide the

video components into three categories: 1) target, 2) environment and 3) sensor-

based components as presented in Table 2.2 (the video components in our case are

shown in italic).

Table 2.2: Video components

Target Based

Number Single Multiple

Density Sparse Dense

Rigidity Rigid Non-Rigid

Occlusions Low High

Environmnet Based

Background Moving Static

Space Indoor Outdoor

Light Day Night

Sensor Based

Number Single Multiple

Platform Moving Static

Video type Color Gray

Resolution Low High

The main challenge in crowd analysis is working with multiple targets, since com-

puter vision techniques for object analysis are not directly applicable to multiple

objects. Besides this, the density of a crowd scene is another important factor while

choosing a method to analyze the scene [40]. Crowds density can range from sparse

to very dense, depending on the target surveillance data.

23

Typically, crowd analysis can be categorized under three main titles: 1) people

counting, 2) tracking and 3) behavior understanding [71].

2.2.1 People Counting/Density Estimation

Crowd density estimation and people counting are two fundamental problems for

managing and planning purposes. The challenge of crowd density estimation and

people counting arises while people exhibit limited motion, such as gathering and

waiting [161]. Solutions to these two problems can also be used for object detection

as a feature extraction method for further crowd analysis, such as tracking and be-

havior learning. There are various techniques presented in the literature that can be

categorized under three main titles:

Pixel-based analysis:

Pixel-based analysis mostly relies on background subtraction methods. Background

subtraction methods based on modeling a background to differentiate the foreground

objects in each incoming frame by analyzing the frame pixel-by-pixel. Although there

are various techniques to subtract background, applying approximate median filter

provides better results for density estimation in a crowded scene [35].

Texture-based analysis:

The idea behind texture-based analysis is to assume that a crowd with low density

should create coarse-grained textures, while high density crowds should create finer

textures [159]. Interest point detectors are widely used to extract texture information

from images (e.g. Harris, KLT, SIFT).

24

Object-based analysis:

Object-based analysis aims to detect objects to count the number of targets by ex-

tracting silhouettes, edges or blobs. There are two main categories of object detection

approaches: 1) appearance based, 2) motion based.

• Appearance based: Appearance-based object detection approaches apply image

processing steps to each frame individually to detect objects of interest. The

processing steps commonly include segmentation, point detection (identifying

interest points), templating (generating contours). Segmentation methods par-

tition an image frame by considering similar regions in the image. The texture-

extraction method is based on applying image descriptor algorithms. Contour-

based method is based on using a template as a filter to detect objects.

• Motion based: Motion-based object detection approaches take advantage of

(pixel-wise) differences between frames caused by moving objects, such as back-

ground subtraction, temporal differencing, and optical flow. Background sub-

traction methods rely on modeling a background to differentiate the foreground

objects in each incoming frame. In temporal differencing, the motion is esti-

mated by using consecutive frames in a time interval to find deviations in the

following frames. When compared to background subtraction, it may be thought

of as dynamic background modeling. Optical flow is a pixel-wise short-term mo-

tion computation between consecutive frames, where each vector represents the

direction and amount of motion [107].

25

2.2.2 People Tracking

The tracking task can be defined as a labeling process on detected objects or motion

flows for each frame in a video. A survey of available methods for object tracking

are presented in [154]. We summarize the people tracking methods under two main

categories: 1) detection based, and 2) motion flow based tracking.

Tracking by detection:

This approach is based on the association of detected objects between image frames.

Tracking by detection approach works better in low density crowds because of the de-

tection problems in high density crowds. Prior work in this area considers association

of detected objects between image frames.

Tracking by motion flow

This approach is based on extracting motion flow for the objects or points that move

coherently. Texture-level analysis is used to compute regression estimates. This

approach is more suitable for structured, high density, crowds [5].

2.2.3 Behavior Learning

Crowd analysis for behavior learning is used in wide variety of applications includ-

ing event recognition and anomaly detection. In video analysis, event recognition is

defined as the process of monitoring and analyzing the events occurring in a video

surveillance system in order to detect signs of security problems [71].

Anomaly detection in a crowded scene can be performed for detecting abnormal

events from unusual motions (e.g., fights), sizes (e.g., a vehicle in the pedestrian road),

directions (e.g., walking through a restricted area), and velocities (e.g., an escape from

26

a threat) or any combination of these anomalies in the targeted scene. Because of

the limitations of acquiring components of the surveillance video, anomaly detection

is a challenging problem in crowd analysis. First of all, the type of anomaly that

is looked for may depend on various definitions such as, density of the crowd (e.g.,

low or high density), surveillance area (e.g., streets, pedestrian paths, indoor places,

concert/event areas, train stations and etc.), surveillance equipment (e.g., resolution

quality, number of the cameras, camera platform and etc.). Although it is a difficult

task to define abnormal activity in an explicit manner, these components help to

define the target anomaly and appropriate approaches. It is important to mention

that this thesis is not trying to detect abnormal objects, which have abnormal object

sizes, speeds or directions, such as a vehicle on the pedestrian road or a person in

the restricted area [95, 15]. Instead, we aim to define behavioral structures of people

and recognize the events in the video to detect abnormal events, such as a sudden

dispersion, running or merging of considerable number of people [165, 160].

Behavior-based learning approaches can be divided into to two main categories:

1) object- based and 2) holistic-based approaches [101].

Object-based approaches

Object-based approaches focus on segmentation and detection methods to learn be-

haviors of individuals in a crowded scene. Two of the main applications include:

1) detecting a person in a crowd whose behavior is not similar to learned behav-

ior [21], and 2) detecting abnormal interaction within groups of people [98]. This

approach is not suitable for high density crowds since it is difficult to track individu-

als where the whole process is affected by occlusions.

27

Holistic approaches

Holistic approaches treat the crowd as a single entity to learn the global behavior of

a crowded scene. Instead of learning behaviors from individual targets, this approach

focuses on modeling techniques to extract dominant behavior such as motion flow, or

bottlenecks [27, 115]. This approach is more suited for dense crowds.

2.3 Anomaly Detection

The anomaly-based detection depends on defining normal behavior by training a

model for the considered dataset. This defining process has been performed with

various approaches and techniques by researchers according to the specific application

domain and the dataset structure. The structure of the dataset is the key point of

the modeling which specifies the applicable methods for detection.

In many domains, there are many challenges related to working with a dataset

during the anomaly detection process. These issues can be summarized as follows.

First, we want to extract a dataset which covers every possible normal behavior. Even

if it is possible for the current situation, the normal behavior may change over time.

Second, we need to ensure the correctness of data samples. In some domains, such

as a video anomaly detection, datasets needs to be extracted from a raw data via

preprocessing. If the extracted dataset has incorrect information or missing feature

points, then the anomaly detection process would not be successful. Third, we need to

collect a sufficient amount of data. There is a minimal number of data points needed

to learn a behavior. Fourth, we need to properly label the collected dataset. It is

more appropriate to test an anomaly detection method on a real dataset to prove its

success for a real world problem. This adds a time-consuming step in order to provide

28

labels on a collected data set. Fifth, we need to select the appropriate features. In

some cases the dataset would include information which is not necessary, and can

impact the detection process negatively. In these cases, there is a need to apply a

dimensionality reduction or a feature selection technique to improve the efficiency of

the model.

2.3.1 Anomaly Detection Algorithms

The techniques used to detect anomalies are developed by employing various ap-

proaches. In anomaly detection literature, these approaches has been grouped under

different categories by researchers.

In 2004, Hodge et al. [65] reviewed outlier detection methods used in three field

of computing: Statistics, Neural Networks and Machine Learning. This prior study

also mentioned that there are hybrid systems which can combine multiple methods

to achieve better results.

A study presented by Ng [104] grouped anomaly detection methods into two

classes: 1) discriminative and 2) generative. Discrimination methods include tech-

niques which perform a function fitting to extract a decision rule from input data to

the output labels, such as Nearest Neighbor (NN), Support Vector Machines (SVM)

and Neural Networks [28, 69]. Alternatively, generative methods refer to techniques

that build a model by learning the relationships between the input data and the out-

put labels to estimate the underlying system behavior class, such as Parzen windows,

Mixture of Gaussians, Hidden Markov Model (HMM) [152, 110, 153].

Chandola et al. [31, 30] presented a detailed review of anomaly and outlier de-

tection methods with various data sources. Their work provides a broad overview of

extensive research on anomaly detection techniques in multiple application domains.

29

They categorized the techniques under six groups: classification based, clustering

based, NN based, statistical, spectral and information theoretic.

Although there are different approaches taken to classify anomaly detection algo-

rithms, the anomaly detection task fits well in the machine learning field, since the

nature of the anomaly detection process depends on learning a behavior or item-sets

from a given data.

2.3.2 Machine Learning

Machine learning involves design and development of algorithms that learn from train-

ing data or past experiences. Machine learning techniques focus on building models

for various purposes, such as prediction, recognition, diagnosis, detection, planning,

controlling etc. [7]. The field of machine learning overlaps with many other fields,

such as data mining, statistics and information theory, typically sharing some algo-

rithms or approaches to solve a learning problem. Since there are no strict boundaries

between these fields, an anomaly detection algorithm falls under one or many of fields,

according to application and the target output. It follows that most techniques used

for anomaly detection, which include a learning phase, can also be considered machine

learning techniques.

In a generalized approach, there are three steps to be performed for a machine

learning-based anomaly detection: training, testing and analysis. The training step

includes profiling the behavior of the data. The test step refers to comparisons be-

tween the current activities and the learned behavior model. Finally, the analysis step

evaluates the test results to report significant deviations. Based on the availability of

the data labels, there are three different learning approaches for anomaly detection:

Supervised learning: Training with the normal and abnormal data with labels.

30

Supervised learning aims to determine the normal and abnormal classes from the

labeled data. If a test instance lies in a region of normality, it is classified as normal,

otherwise it is classified as abnormal.

Unsupervised learning: Training with the normal and abnormal data without

labels. Unsupervised learning aims to determine the anomalies without prior knowl-

edge of the data. This learning technique assumes that anomalies are separated from

the normal data and will thus appear as outliers.

Semi-supervised learning: Training with the normal data. This technique

needs pre-classified data but only learns from the data labeled as normal. Semi-

supervised learning aims to define a boundary of normality.

2.3.3 Hidden Markov Models

Figure 2.1: Hidden Markov Model

We select to use HMMs as our machine learning approach for anomaly detection.

A brief description of a HMM is given in the following and the preprocessing method

and the structure of our modeling based on HMMs is explained in Chapter 5 and

Chapter 6.

HMMs are stochastic models for sequential data that have a broad spectrum of

usage categories that include speech recognition, motion/action analysis in videos,

31

human recognition and protein sequence alignment. A generic HMM structure is

presented in Figure 2.1 which represents the joint probability distribution over states

and observations as given in (2.1).

P (Y,X) = P (X1|π)T∏t=2

P (Xt|Xt−1)P (Yt|Xt) (2.1)

In an HMM, the states are not visible but the observations and the probability

distributions over the observations are known. Let H={H1, H2, H3, ..., HN} be the

set of hidden states, where N is the number of states and V={v1, v2, v3, ..., vM} is the

set of observation symbols, where M is the number of observation symbols. Then

there are three probability measures in an HMM; 1) A (state transition probability

distribution), 2) B (observation symbol probability distribution), and 3) π (initial

state distribution). HMM is represented in a compact form as λ = {A,B, π}.

The state transition probability distribution A = {aij} represents the probability

of being in a hidden state j at time t+ 1, given a known state i at time t:

aij = P [xt+1 = Hj|xt = Hi] (2.2)

1 ≤ i, j ≤ N, aij ≥ 0,N∑j=1

aij = 1

The observation symbol probability distribution B = {bj(k)} represents the prob-

ability of observation symbol vk in state j:

bj(k) = P [vk at time t|xt = Hj] (2.3)

1 ≤ j ≤ N, 1 ≤ k ≤M, bj(k) ≥ 0,M∑k=1

bj(k) = 1

32

The initial state distribution π = {πi} represents the initial probability of state i:

πi = P [x1 = Hi] (2.4)

1 ≤ i ≤ N, πi ≥ 0,N∑i=1

πi = 1

The observation sequence Y is represented as:

Y = {Y1, Y2, ..., YT} (2.5)

where T is the length of the observation sequence and each observation Yt is one of

the symbols from set V .

There are three kinds of problems one can solve by using HMMs [114].

1) Evaluation: computing the probability of observation sequence P [Y |λ], when

the observations sequence Y and a model λ are given. Calculating the likelihood of

a model using the forward/backward algorithm has time complexity O(N2T ) where

N is the number of states and T is the length of sequences.

2) Decoding: estimating the optimal state sequence, when the observations se-

quence Y and a model λ are given. Decoding the state sequences using Viterbi

algorithm has time complexity O(N2T ) where N is the number of states and T is the

length of sequences.

3) Training: estimating the model parameters λ = {A,B, π}, when the obser-

vations sequence Y and the dimensions N and M are given. Estimating the model

parameters using the Baum-Welch (Expectation Maximization) algorithm has time

complexity O(N2TI) where N is the number of states, T is the length of sequences

and I is the number of iterations.

We apply the solution of the third problem to find the best-fit model. Then

estimated parameters are used for the first problem to find the likelihood values of

the observation sequences.

33

2.3.4 Evaluation Metrics

There are two possible outputs of an anomaly detection task: 1) a label, or 2) a score.

Label outputs can be represented by string words, binary values or a letter from an

alphabet of length two, such as normal or abnormal, 0 or 1, or a or b. On the other

hand, output scores are the calculated (also called decision values) and we can apply a

threshold to the decision value to decide if a sample is normal or abnormal. In either

output types, we need to evaluate the results of an anomaly detection task. Next,

we present the evaluation methods that we have used to evaluate our implementation

results.

Confusion Matrix

A confusion matrix provides information about the number of normal and abnormal

instances in actuality, and the number of normal and abnormal instances in the

analyzed results. Table 2.3 shows the confusion matrix for a two-class classifier. “True

(T)” indicates that the prediction is correct, and “False (F)” is incorrect. “Positive

(P)” is used to indicate an “abnormal” class and “Negative (N)” a “normal” class.

There are four kinds of data in the confusion matrix to show the correct and incorrect

predictions of the two classes: 1) TP, 2) FP, 3) TN and 4) FN. For example, TP (True

Positive) indicates the number of “abnormal” instances that are predicted correctly.

Table 2.3: Confusion matrix

ACTUALPREDICTED

Positive NegativePositive TP FNNegative FP TN

34

Rate Measures

The basic evaluation measures are derived from the information that the confusion

matrix provides. The definitions and the formulations of these evaluation measures

are explained below.

The precision is the percentage of the correctly predicted anomalies (TP), com-

puted over all predicted anomalies.

Precision =TP

TP + FP(2.6)

The true positive rate (TPR) is the percentage of the correctly predicted anoma-

lies (TP) over all the actual anomalies. The TPR is also known as the recall rate

or sensitivity.

TPR =TP

TP + FN(2.7)

The true negative rate (TNR) is the percentage of the correctly predicted nor-

mal cases (TN) over all the actual normal cases. TNR is also known as specificity.

TNR =TN

TN + FP(2.8)

The false positive rate (FPR) is the percentage of the normal cases that are

incorrectly predicted as anomalies (FP) over all the actual normal cases.

FPR = 1− TNR =FP

TN + FP(2.9)

The false negative rate (FNR) is the percentage of the anomalies that are

incorrectly predicted as normal cases (FN) over all the actual anomalies.

FNR = 1− TPR =FN

FN + TP(2.10)

35

The accuracy is the percentage of the correctly predicted cases (TP+TN) over

all the test samples.

Accuracy =TP + TN

TP + TN + FP + FN(2.11)

Receiver Operating Characteristic Curve

Figure 2.2: ROC curves

A Receiver Operating Characteristic (ROC) curve is a graphical plot which illus-

trates the performance of a classifier by presenting a trade-off between detection rate

36

(TPR) on y-axis and FPR on the x-axis [100]. ROC curves can be plotted for different

threshold values.

A generic example of an ROC curve is shown in Figure 2.2. Point (0, 1) is the

ideal operating point on an ROC curve, where the TPR is 100% and FPR is 0%. The

line x = y shows the performance of a random classifier which makes guesses on a

classification of data points [51].

There are two fundamental features of an ROC curve representation. First, it

provides a method to find the optimum threshold value for a classifier. For example,

an operational point is possibly optimal if it is closest to the ideal point. If the

distance of two points are equal on the ROC curve, then the one with the lower FPR is

preferable. Second, it provides a comparison environment for classifiers. For example,

a classifier is possibly optimal if it is closest to the y-axis since the optimal detector

should have the lowest FPR with the highest TPR. Another comparison technique is

computing Area Under Curve (AUC) values of ROC curves for each classifier. AUC

value presents a single-number discrimination measure across all possible ranges of

threshold to assess the performance (accuracy) of a model.

Typically, it is assumed that the optimal detector should have the largest AUC [19].

Although this assumption is correct in general, it is recommended to consider the sen-

sitivity and specificity, instead of using the AUC values alone. Visual inspection of

the ROC curves would also be needed when misclassification (FPs and FNs) costs are

unequal in accordance with the application goal [93].

Cross Validation

Cross validation is a re-sampling technique for the performance evaluation of a train-

ing approach where only a limited sample size is available. In general, cross-validation

37

techniques divide a dataset into approximately equal-sized subsets and perform an

evaluation on each subset, iteratively [81]. One type of cross validation is k-fold cross

validation, which is a common re-sampling technique to evaluate how accurately a

detection model will perform in practice. The iterative procedure for k-fold cross

validation presented in Figure 2.3 is formally defined as follows:

• Step 1. Divide the samples randomly into k-folds (subsets).

• Step 2. Select one fold as a validation set and use remaining k-1 folds for

training.

• Step 3. Repeat step 2, until each subset has been tested exactly once.

Figure 2.3: K-fold cross validation

As a result, each fold plays the role as the test set exactly once, and is part of the

training set k− 1 times [99]. The overall performance of the model may be estimated

by either averaging or combining the results obtained from the test folds in each run.

A complete cross validation may be obtained by repeating k-fold cross validation

multiple times using different folds, or by setting the k value to the number of data

38

samples (known as leave-one-out cross validation). Although, the variance of the re-

sulting estimate is reduced as k is increased, this approach is not preferred in practice

since it is computationally expensive [81].

39

Chapter 3

Related Work

In this chapter, we provide related work on anomaly detection for the areas of system

call analysis, crowd analytics and context-aware systems.

3.1 Related Work in System Call Analysis

Since computer security becomes an important issue, the need for efficient and accu-

rate intrusion detection methods as become mandatory in order to protect computer

systems from intrusions. As a result, a number of Intrusion Detection Systems (IDS)

have been developed to defend network and computer systems against possible dam-

age and information loss.

Anomaly detection-based intrusion detection has been an active research area

since it was proposed by Denning [41]. Anomaly detection can be used in Host-based

Intrusion Detection Systems (HIDS) using a range of different techniques and data

sources to improve computer security. There are many forms of data that can be used

for HIDS such as CPU usage, time to login, names of files accessed, user commands,

keystroke records, and system call traces. In recent work, system call traces have

40

commonly been used to analyze program behavior. System call behavior has been

studied extensively in prior work on HIDSs and there have been many applications

of this approach discussed in the computer security literature [41, 96, 97, 108, 120,

31, 153]. The idea behind using system call traces is based on being able to track

each request that a program makes to the operating system during its execution. A

system call sequence is a discrete sequence such that each system call belongs to a

finite alphabet of system calls executed by a particular operating system [32].

3.1.1 Data Representation in System Call Analysis

There have been a variety of approaches proposed for anomaly-based intrusion detec-

tion using system calls traces. Their differ mainly based on the type of data used as

input to the detector and the specific anomaly data representation.

Some of the previous work has focused on using only system call arguments [85,

103], while others have combined the system call sequences with the arguments [94,

132]. But the majority of the previous work in this area has focused on using only

system call sequences to train a behavior model. Working only with system call

sequences, the various implementations differ in how data is represented. In gen-

eral, these representations can be grouped into two categories based on their feature

extraction methods: 1) frequency-based methods and 2) sequence-based methods.

Frequency-based feature extraction methods

The frequency-based feature extraction methods rely on the number of occurrences

of each system call. For example, using a “bag of words” representation (which is

commonly used in text classification), we can map a system call anomaly detection

to this representation. In [75], each trace is treated as a document and each system

41

call in a document is treated as a word. Since the “bag of words” representation

can be used with many of the machine learning algorithms, it is widely used in

system call anomaly detection [88, 151, 162]. Instead of only counting the number

of occurrences, some approachs improve detection by applying a ranking approach

based on the relative order of frequency values [137].

Another rich representation of a frequency-based vector is the term frequency-

inverse document frequency (tf-idf), where term refers to a system call and document

refers to a system call sequence [34]. In [141], several forms of tf-idf are presented

and experimented with various classification algorithms on system call sequences and

HTPP log data set.

Sequence-based methods

Sequence-based methods use the order of system calls or place where a system call

occurs in short sequences. This information is extracted from a system call trace. For-

rest et al. [54] introduced an approach for intrusion detection-based on monitoring the

system calls of a program during execution. The idea behind their work depends on

the immune system’s ability of distinguishing “self” from “nonself”. They extracted

normal program behaviors from system call traces to define “self” for Unix processes.

A normal behavior database is created using a sliding a window of size k + 1 over

the system call sequences. Then for each system call, they recorded the following

system calls for each position from 1 to k. During the testing phase, test sequences

are scanned and the percentage of mismatches are computed by considering the max-

imum number of possible pairwise mismatches for a sequence with a lookahead of k.

In [66], they improved their prior work by using a “stide” (sequence time delay em-

bedding) system. They first generate a normal database from a set of length k unique

42

short sequences and use the “Hamming distance” to compute deviations of the test

sequence from the normal. Success of both approaches depends on the completeness

of normal dataset and the length of sliding window. If a normal dataset does not

include all possible short sequence that a program can perform during its execution,

then those will result in false positives. If the length of sliding window is too short,

the anomalies in the test sequences can be missed since an abnormal short sequence

may include many normal shorter sequences; and if it is too long, it may consume

considerable system resources. One outcome of this prior work was the generation of

a benchmark dataset for further system call trace analysis [136].

3.1.2 HMM in System Call Analysis

A large amount of prior work has been done in the area of anomaly detection-based

HIDS. In this section, we present prior work related to anomaly detection-based

HIDSs using Hidden Markov Models for system call analysis.

Hoang et al. [64] developed a multi-layer model based on HMMs to decrease the

False Positive Rate (FPR) of system call anomaly detection. They created a normal

dataset using a sliding window run over system call traces. In their model, a first-layer

checks the given test subsequence, and if there is a mismatch or if it is a rare sequence

in the normal dataset, then the subsequence is sent to the HMM layer. The HMM

layer computes the likelihood value and compares it with a predefined threshold to

determine whether the sequence is normal or abnormal. Their approach decreases the

FPR, since the likelihood value of a normal subsequence is higher than the abnormal

ones, even if it does not appear in the normal dataset. Their work also provides for

detection using only a short period of time without needing to wait for the end of the

program execution.

43

Yeung and Ding [153] experimented with Fully-Connected HMMs and Left-to-

Right HMMs as dynamic modeling approaches to evaluating sequences of system calls

and minimum cross entropy as an information-theoretic static modeling approach

using the occurrence frequency distributions of system calls. Their results showed

that the dynamic modeling approaches are more suitable for system call datasets

and the Fully-Connected topology between states were more successful compared to

Left-to-Right.

Du et al. [47] implemented HMM-based anomaly detection by defining two hidden

states, one for normal states and one for abnormal states. They compute the relative

probability of system call sequences to determine if they are normal or abnormal ac-

cording to a given HMM model. Experiments on the UNM sendmail and lpr datasets

showed that the relative probability value differences between normal and abnormal

sequences are distinct. The proposed model is simple and effective when applied to

the intrusion detection problem.

Khreich et al. [78] proposed multiple-HMMs (µ-HMMs) for system call anomaly

detection to overcome the issue of selecting the number of hidden states. The µ-

HMMs approach is based on training multiple models with a varying number of

hidden states and combining the results according to the Maximum Realizable ROC

(MRROC) method. The proposed model provides better performance on system call

traces when compared with using a single HMM.

Although HMMs applied to system call sequences show better results as compared

to static approaches, there are still concerns about the required training time. In this

regard, Hu et al. [68] proposed a simple data preprocessing approach to speed up

HMM training. They improved their previous work [63], in which they proposed an

incremental HMM based on dividing the observation sequences into a number of (R)

44

subsequences and merging the parameter estimation results from the R trained sub-

HMMs, removing similar subsequences of system calls from the normal dataset. The

result is that the training time can be reduced by up to 50% with a reasonable FPR.

Since profiling complex sequential data is still an open problem in anomaly detec-

tion, there is still a need for further enhancements. While there are a number of prior

approached that used HMMs for anomaly-based intrusion detection, these models

became more complex as each improvement is included to increase the detection rate

while reducing the FP rate.

There are several reasons for this growth in complexity. First, a detailed analysis

of program traces is missing in most of the prior work and HMMs are trained for

a range of behaviors for normal execution. This kind of training has a huge impact

on learning a normal behavior. Second, some of the process traces collected during

different executions of a program are identical. This leads two different problems:

1) For the training step, there would be several identical training traces. Using

unique process traces is necessary because of the need to decrease learning time while

using HMMs. 2) For the test step, using different program traces including identical

process traces does not provide an accurate learning and testing approach. If the

training and test sets are partitioned only by considering program execution, it is

possible to see identical process traces in training and testing dataset at the same

time.

Our process trace clustering approach for using HMMs on system call anomaly

detection provides better results, more accurate model settings and a less complex

structure to detect anomalies. Details of our training approach is presented in Chap-

ter 5. In this thesis, we show how to use system call traces with a HMM method and

explore which preprocessing technique is most suitable for anomaly detection.

45

3.2 Related Work in Crowd Analysis

In this thesis we utilize Crowd Anomaly Detection as a motivating application. In

this section, we present prior work on crowd analysis for behavioral learning and

anomaly detection.

Anomaly detection is an active area of research in surveillance applications of

crowd scenes. Typically, the anomaly detection task involves finding data samples

that do not conform to a definition of normal. In crowd analysis, based on the target

video scene and the analysis goal, there are number of approaches we can use for

identifying abnormal behavior and many different data (feature) types that one can

extract from a crowd video.

3.2.1 Data Representation in Crowd Analysis

In crowd analysis, feature extraction is highly dependent on the goal of the analysis,

since the crowd anomaly detection approaches can be categorized according to type

of scene representation [95]. These features may be grouped roughly under two cate-

gories with respect to the problem of interest, video components and techniques that

are used. The first category is object-level features which are extracted by applying

the methods for detection and/or tracking of individuals. The second category is

frame-level features which are extracted from pixels or patches present in each frame

and then considering the dynamics in consecutive frames.

Object-level Features

In video surveillance applications, one of the main problems of interest is detecting

abnormal behaviors of individuals, or the relationship and connections between these

46

individuals [58].

Object-level features provide the information of individuals in the scene, including,

but not limited to, color, size, shape, speed, direction, and trajectory. Object-level

features are typically used for low to moderate-density crowds for local or global

abnormal event detection, such as people interaction (fighting, following, meeting),

restricted area access (vehicles in a pedestrian path), and similar scenarios. There

are many applications of anomaly detection using object-level features during the

surveillance of critical places where we need to understand behavior of individuals or

groups. An extensive survey of object-level approaches in transit surveillance can be

found in [23].

A trajectory feature is the one that is mostly used to learn crowd behavior in

object-level analysis [36]. This approach may be used in either sparse or dense crowds

by applying appropriate techniques. Researchers have proposed various approaches to

use trajectory information for anomaly detection, such as similarity analysis between

trajectories [70, 145], motion pattern modeling by using machine learning approaches,

and include HMMs [62, 74, 138] and Bayesian models [142, 143, 144].

Hervieu et. al [62] proposed a statistical trajectory-based approach addressing

two issues related to dynamic video content understanding: recognition of events

and detection of unexpected events. They compute local differential features com-

bining curvature and motion magnitude on the motion trajectories and the temporal

causality of the features is then captured by HMMs. Event recognition is performed

by classifying the trajectories according to learned trajectory classes. Unexpected

events are identified by comparing the test trajectories to representative trajectories

of known classes of events.

While detection-based methods provide more accurate results, they may fail in

47

high density crowds because of a high degree of occlusion and their required compu-

tational complexity [148].

Frame-level Features

Behavior learning for anomaly detection in extremely crowd scenes is another area of

great interest in video surveillance applications. In heavily crowded scenes, perform-

ing crowd behavior analysis by tracking individuals is impractical [127]. Instead, we

can try to understand collective behavior of a crowd to classify the events.

Frame-level features generally provide global information about a frame or patches

at a pixel level, including, but not limited to, color, texture, shape, motion. Frame-

level features are mainly used for moderate to high-density crowds for abnormal event

detection, such as collective behavior (bottlenecks, lanes), sudden behavior changes

(running, evacuation), and similar analyzes.

Anomaly detection via tracking individuals in extremely crowded scenes is chal-

lenging due to the high density of features and frequent occlusions [82]. Therefore,

spatio-temporal gradients and optical flow are two popular feature representations

to model motion patterns [95]. Spatio-temporal-based crowd analysis aims to detect

anomalies based on the appearance and motion of objects in a temporal time sequence

without tracking [17, 43, 82]. Optical flow technique depends on modeling the motion

information of the flow points which are generally extracted as descriptors [139].

In [33], authors estimated optical flow to cluster crowds into groups by using an

adjacency-matrix based clustering (AMC) method. Then they characterized group

behaviors by a model of the force field which provides information about the orien-

tation and the force of each crowd. Abnormal crowd behavior events are defined as

the orientation of a crowd abruptly changes or if the interactions in the crowd differ

48

significantly from the predicted behavior.

In [148], researchers provide an energy-based model to estimate the number of

people. To represent the spatial distribution of a crowd, they combined people count-

ing results with crowd entropy. Their work detects two abnormal activities of people:

1) gathering and 2) running.

In [36], researchers proposed an anomaly detection framework to model the spatio-

temporal distribution of crowd motions and detect anomalous events by learning

regions of interest from historical trajectory sets and the statistical template of the

pedestrian distribution. They build a Hierarchical Pedestrian Distribution, a series

of histograms based both on global and local levels, as the templates for the observed

movement distribution. This distribution statistically describes time-correlated crowd

events using overall crowd information and local details in the regions of interest.

Mehran et. al [101] proposed a social force model to capture the dynamics of the

crowd behavior for localized anomaly detection in a crowd scene. They computed

across a grid of particles the space-time average of optical flow and obtained Force

Flow values for every pixel in every frame. To model normal behavior, they use

spatio-temporal volumes of Force Flow. Anomalies are detected based on a bag of

words model of the social force fields.

3.2.2 HMM in Crowd Analysis

A large amount of prior work has been done in the area of anomaly detection in

crowd scenes. In this section, we present prior work related to anomaly detection and

behavior learning-based crowd analysis using Hidden Markov Models. HMMs have

been used in different ways to detect abnormal behavior in a video scene depending

on the data that needs to be modeled to learn normal behavior, such as human

49

interaction modeling, optical flow analysis, and related behaviors.

Oliver et al. [105] used HMMs for two different architectures, namely HMM and

Coupled-HMM (CHMM), to compare their performance on learning normal behavior

and recognizing human behaviors. Their system focused on recognizing the interac-

tions between people, such as following another person or meeting another person.

Training is performed by using synthetic data, created to develop flexible and in-

terpretable behavior models. A CHMM can provide improved results in terms of

training efficiency and classification accuracy.

Vaswani et al. [138] proposed a “shape activity model” to model “activity” per-

formed by a group of moving and interacting objects (referred to as “landmarks”).

A continuous-state hidden Markov model is defined for landmark shape dynamics in

an activity where the object locations at a given time forms the observation vector,

and the corresponding shape and motion parameters constitute the hidden-state vec-

tor. An abnormal activity is detected with a change in the shape activity model.

The model is tested for abnormal activity-detection in an airport scenario involving

multiple interacting objects.

Andrea et al. [9, 10, 8] used HMMs with a mixture of Gaussians to characterize

normal behavior of a crowd by learning normal motion patterns from optical flow

of image blocks. They applied Principal Component Analysis (PCA) to build the

feature prototypes and used spectral clustering to find the optimum number of models

to group video segments with similar motion patterns and trained a HMM for each

model. These HMMs are then used for event recognition and anomaly detection.

Kartz and Nishino [82, 84] extracted spatio-temporal gradients to fit a Gaussian

model and use a collection of HMMs, one for each spatial patch in a frame, to model

the structured crowd motion. In [83], the authors introduced the use of pedestrian

50

efficiency to detect unusual events and to track individuals in crowded scenes. They

used the trained HMMs to estimate the intended motion at each space-time location

in a different video of the same scene. For each scene, they train the HMMs on a

usual sequence, and estimate the efficiency of the remaining sequences. A frame is

considered unusual if its average efficiency is below a specific threshold that is selected

empirically. In contrast to their previous work, in [83] the authors train the HMMs

on directional statistic distributions of optical flow resulting in a more compact and

accurate representation. They showed that the pedestrian efficiency can be used to

detect abnormal activities without detecting and tracking individuals.

3.3 Related Work in Context-aware Systems

In this thesis, we propose a framework for context-aware anomaly detection on sym-

bolic sequential dataset. Using a context-aware approach is not a new idea. It has

been used in different research areas such as web services, information services, smart

phones, smart environments, etc. The core of context-aware systems is the ability of

providing specific services according to context of a user, item, task or event.

In order to understand the context-aware approaches, it is first necessary to dis-

cuss the definition of context and context awareness. Context has been studied across

different disciplines and each discipline tends to define context from its own view for

specific applications [1]. Therefore, there are many different definitions of context in

literature depending on the application area and discipline. An exhaustive review

of 150 definitions coming from different domains is presented in [16]. The authors

collected the definitions of context on the web by researching cognitive science and

related disciplines to understand the definition of context. Although their analysis

51

show that the content of all the definitions can be analyzed through six essential com-

ponents (constraint, influence, behavior, nature, structure and system), they conclude

as: “a definition of context depends on the field of the knowledge that it belongs to”.

Dourish [46] also introduced a broad taxonomy, by categorizing the contexts into

two categories: 1) representational: the context can be defined with observable and

stable attributes, 2) interactional: the context is not necessarily observable. The most

important distinction between these two categories is the definition of the relation

between context and activity. In representational, while it is defined as “context

and activity are separable”, in interactional, it is defined as “context arises from the

activity” [46]. Considering context as an interactional problem is an alternative view

to context definition, since it is looking for relevancy between the activities instead

of considering something as a context.

A generic definition of context is given by Hong et al. [67] as “any information that

can be used to characterize the situation of an entity”. The authors provided a com-

prehensive overview and presented a classification framework on the context-aware

systems. Besides their generic definition, they categorized the context as external and

internal. External context refers the context data which is collected through physical

sensors, such as location, distance, temperature, time, etc. On the other hand, inter-

nal context refers the context data which is extracted by analyzing the collected data

to understand the internal elements, such as preferences, tasks, emotional state, etc.

3.3.1 Context-aware Applications

Context information can be used to improve service in real life situations, which

is called context-awareness. The main advantage of context-aware systems is their

ability to adapt their operations to the current context [13]. For example, if a person

52

is busy (e.g. running, working) or in place that he/she probably wishes not to be

disturbed, then a smart-phone can go silent or send a predefined automatic message

as an answer. Although, context-awareness is widely used as a property of mobile

devices to adapt their behavior according to physical environment, there are many

other applications in different domains, such as recommender systems, process design,

or recognition systems.

Recommender systems aim to generate more relevant recommendations by pro-

viding specific services according to contextual situation of the user [2]. A music

recommendation based on users’ music history of preferences is presented by Hariri

et al. [59]. In their work, the context is not fully observable but they inferred it

from users’ interactions with the system. A friend or item recommendation on so-

cial networks is proposed by Liu and Aberer [92]. The authors, combine the context

and social network information via inferring a users’ preference by learning his/her

friends’ tastes to improve the quality of recommendation.

Applications for recognition systems are generally presented in computer vision

area. Activity recognition problem is one of the most important goals in many surveil-

lance systems. Zhu et al. [166, 167] presented a context-aware activity recognition

framework to detect anomalies. In their work, they proposed a model to learn the

context patterns from set of activities. Another work is presented by Lan et al. [86] to

recognize actions of individuals by using contextual group activities. The contextual

information is captured from the individuals and the relations between the individ-

uals. They show the importance of contextual information on activity recognition

problems.

Another common application area of context-aware systems is process design.

Process design can be applied to solve various problems in different domains. An

53

example of context-aware process design is applied to business theory in relation

to business process management issues in [118]. There are also various application

examples on health care. In [14], authors presented some application examples and

design principles of context-aware computing for medical work in hospitals, such as

context-aware pill container and context-aware bed which react according to context.

3.3.2 Context Inference

Knowledge in the form of contextual attributes in context-aware systems can be

classified into three categories [2]: 1) Fully observable: the contextual attributes,

their values and the structures are known. 2) Partially observable: some information

is known about contextual attributes. 3) Unobservable: no information is known

about contextual attributes.

In fully observable context attributes, contexts attributes can be used directly to

adapt the system to the context. But in the case of partially observable and unobserv-

able contextual attributes, it is required to be able to infer the contexts. In general,

context inference is based on two main steps: 1) reading sensor values to determine

user context, such as user is moving, not-moving, sound is loud, temperature is high,

etc. 2) using these readings to identify the context. Santos et al. [122], provided a

prototype of a context inference engine for mobile applications. They applied a clas-

sification technique since they know the context classes which are walking, running,

idle, resting. In order to learn the classifications of contexts, they built a decision

tree by using the features obtained from multiple sensor readings such as sound, light,

time, etc. Then a context is determined by searching through decision tree based on

the test features extracted from sensor readings.

Instead of using sensor values to infer context attributes, there are some other

54

works that propose approaches to learn contexts for a problem without an observ-

able contextual attribute. In this kind of context inference, the data needs some

preprocessing in order to generate the context models. This preprocessing would be

a classification or clustering technique.

Tu [135] proposed an auto context algorithm to learn a context model for computer

vision tasks. The algorithm learns a classifier for local image patches from a set of

training images and their label maps. Then the classifications are used as context

information for further tasks.

In [18], authors adapt their object tracking application depending on the changes

in background context such as illumination, coloring, scaling, etc. In their work, the

background contexts have been identified by clustering various training backgrounds

to deal with the changing backgrounds. In the test, the best fitting background

cluster is determined for each frame, then the corresponding object descriptor for the

determined context is selected to track objects.

In our case, we do not have any sensor values or a predefined context, but we desire

to learn the contexts of data from itself. The sequential structure and distributions are

the only information we observe to extract the contexts in our generic framework. This

context learning is a clustering problem since we do not have any prior information

about contexts, such as the number of contexts or the features that can be used

to identify the contexts. In the test, we want to identify contexts from a learned

model which is generated via clustering the similar data sequences to capture the data

context. Details of the proposed context learning approach is explained in Chapter 4.

55

Chapter 4

Context Learning

In this chapter, we describe an automated context learning approach for symbolic

sequential data. We first provide our definition of context for symbolic sequential

data. Next, we review clustering approaches for sequential data by summarizing.

Then we present our methodology to perform automated context learning.

4.1 Context in a Symbolic Sequential Data

Context-aware systems have been studied across various application areas to enhance

the effectiveness of the particular application. In general, a context can be identified

with a context attribute such as time, age, weather, location or a combination of

these attributes. For example, an hourly sales metric for a retail store would display

different characteristics for different “days” of the week and a normal heart rate range

would differ according to “age” or “gender” of the people. While “day” information

could be used to identify unique contexts for sales metrics, “age” or “gender” could

be used as contexts for an expected range of heart rates from a population. Trying

to analyze a dataset without first considering these contexts may lead to confusing

56

behaviors or statistics. Selecting a context attribute depends on the definition of a

context. As presented in Chapter 3.3, the definition of a context may change according

to application field or analysis goal.

In this dissertation, we provide a generic framework for anomaly detection on

symbolic sequential data and context learning is used for filtering during a prepro-

cessing step in order to enhance the accuracy of our framework. The target data we

focus on contains a sequence of symbols generated from a finite symbolic alphabet;

there is no context attribute to classify data samples in a straightforward way. We

defined context as a set of data instances that represent similar characteristics, by

assuming that sequential data samples will be structurally similar if they are contex-

tually related. In our definition, “being in a same context” would indicate that the

data samples are collected from a similar source, monitored under similar conditions,

generated for similar purposes, etc. Since the context attribute is unobservable, it is

necessary to be able to categorize the sequential data samples into the contexts with

an unsupervised learning (clustering) approach. Basically, clustering can be defined

as dividing data samples into groups according to their similarities. We allow each

cluster to represent a context. In the following steps of our anomaly detection frame-

work, identifying contexts in a dataset allows us to model normal behavior for each

context individually.

4.2 Clustering for Context Learning

Clustering is one of the most commonly used partitioning methods in the field of

data mining. A number of surveys and texts have described the richness of different

clustering methods [73, 76, 129, 163, 164]. Clustering is an unsupervised learning

57

technique which aims to discover the natural boundaries between data samples by

minimizing the similarity in the clusters and maximizing dissimilarities between the

clusters.

Clustering algorithms can be broadly divided into two categories [72]: 1) hi-

erarchical and 2) partitioning. Hierarchical clustering algorithms recursively find

nested clusters either using agglomerative (bottom-up) or divisive (top-down) meth-

ods. Hierarchical algorithms differ in the criteria that they use to determine similar-

ity between clusters, such as single-linkage, average linkage, and complete linkage [3].

Partitioning-based clustering algorithms find all the clusters simultaneously according

to a predefined cluster number k. The most popular partitioning based clustering al-

gorithm is k-means because of its ease of implementation, simplicity and efficiency [3].

There are three main categories of approaches for clustering sequential data [149]:

1) Sequence similarity, 2) Indirect sequence clustering, and 3) Statistical sequence

clustering.

1) Sequence similarity: This approach depends on defining a similarity or

dissimilarity measure to compute the distances between data samples in a dataset.

A partitioning or hierarchical clustering method can be applied on the extracted

distances.

2) Indirect sequence clustering: This approach depends on generating feature-

based representations from data samples to use classical vector space-based clustering

algorithms.

3) Statistical sequence clustering: This approach depends on modeling data

samples to capture the dynamics of each data groups.

In this thesis we apply partitioning-based clustering for learning the contexts in

a symbolic (discrete) sequence dataset. Since contextual attribute in a symbolic

58

sequence generally corresponds to ordering of symbols, we need a way to cluster

symbolic sequences that captures sequential dependencies.

Sequential data collection procedures may change based on the monitored data or

the application domain. Data collection involves monitoring a data source for a period

of time. This time period can be fixed to a predefined value or it may be variable

based on the start-end times of an event. For example, during program execution,

tracing the data collection task may last for hours or days after the execution has

started. In such cases, we want to be able to recognize the context of a sequential

stream while it is being collected. Next, we present the details of our approach for

context learning.

4.3 Parameter Selection

One of the strengths of our approach is that it can be easily used in various domains

for the anomaly detection task, where the collected data is comprised of discrete

sequences. As mentioned before, context learning is one of the main steps of our

anomaly detection framework, since employing an accurate context learner can im-

prove detection accuracy. There are two important parameters that need to be found

in order to achieve a desired context learning: 1) the number of clusters, and 2) the

number of symbols. Our goal is to identify these parameters automatically based on

the given sequences.

We determine the clustering parameters in two steps. First, the entire training

data is transformed into a feature space and the number of clusters (k value) is

estimated. Second, the sequential data is used to estimate the minimum required

length (l value) of the sequence partition that we need for clustering. To perform this

59

task, an estimated k value in the previous step is used.

Next, we explain the estimation of these two parameters.

4.3.1 Number of Clusters

One important input parameter to any clustering algorithm is to decide how many

clusters are present in the data. If we do not know the number of clusters present,

we must estimate the number of clusters and then determine the membership of the

data samples to those clusters for a given dataset. There are various methods to

select the best number of clusters. We can evaluate the quality of clusters using such

techniques as cross-validation, information theoretic methods, and silhouettes. One of

our goals in context learning is selecting the best number of contexts, automatically.

To achieve this goal, we considered a number of methods, using efficiency as a key

metric for selecting the best cluster number selection method. To perform this task,

we selected a indirect sequence clustering technique. The steps of indirect sequence

clustering are as follows. First, we extracted vector-based features for each sequence,

where the dimension of each vector is equal to the alphabet size. Second, we applied

the x-means clustering algorithm [109] to estimate the number of clusters.

Feature Extraction

Conventional clustering methods are designed for feature vector clustering. One way

to cluster a sequential dataset is transforming the sequences into vectors of features.

Basically, a sequence can be represented in vector form by selecting the unique sym-

bols in the dataset as features. One of the simplest and most commonly used methods

to generate a feature vector is counting the occurrences of each symbol in a sequence,

which is called tf (term frequency). This class of feature vector is commonly used in

60

text clustering. The term frequency tft,d of term t in document D is defined as the

number of times that t occurs in D, where “term” refers to “symbol” and “document”

refers to “symbolic sequence”. Since we would like to work with a relative tf to ad-

dress varying length sequences, a tf vector can be normalized by its length. Dividing

a vector by its length makes it a unit length vector to prevent a bias toward longer

sequences. A tf vector represents the frequency of a symbol in a sequence, though all

the symbols are considered equally important while computing tf. However, generally

rare symbols in a dataset are more informative than frequent symbols that appear

in every sequence. A more enhanced feature vector would include refined weighting

methods based on the frequency of the symbols in the entire dataset. tf-idf (term

frequency - inverse document frequency) is a common feature extraction method that

weights the symbols by their tf weight and their idf weight [4, 121]. Using tf-idf, idf

normalizes the tf score by reducing the weight of symbols which occur more frequently

in the entire dataset.

We can define the idf of t by

idft = logN/dft (4.1)

where N is the total number of sequences and dft (document frequency) is the

sequence frequency of t which presents the number of sequences the t appears in. df

is a measure of the informativeness of symbol t such that the lower values show its

rareness in the dataset and provide higher weights on t. logN/dft is used instead of

N/dft to “dampen” the effect of idf .

The tth symbol (feature) of a particular sequence D can be set to the tf-idf weight

by:

61

tf-idft,D = tft,D∗idft (4.2)

In tf-idf, while tf is increasing with the number of occurrences within a sequence,

idf is increasing with the rarity of the symbol in the entire dataset.

Clustering by X-means

The K-means clustering algorithm partitions a set of data samples into k clusters

based on their features. This requires a predefined k value, which may not be available

in many cases. Furthermore, obtaining the number of clusters k could be the only

reason for running a clustering analysis in some cases. In data clustering, automatic

estimation of k has been one of the most difficult problems in the field [72]. Most

of the proposed solutions to this problem are based on running k-means repeatedly

with different k settings and then choosing the one that is best according to a quality

criteria, such as MML (Minimum Message Length) as applied in [53] or BIC (Bayesian

Information Criterion) as applied in [55].

In our case, we are working with sequential data samples, and we do not know how

many clusters are present in the dataset. Weighted frequency distributions (tf-idf)

are generated from these sequential data samples and are used as feature vectors for

x-means clustering. The reason that we use the x-means algorithm for clustering is

that there is no need to know the number of clusters in advance.

The x-means algorithm, as proposed by Pelleg and Moore [109], is a k-means

extension which efficiently finds k by optimizing a quality metric. In x-means, kd-

tree is used to identify the closest cluster centers for all the data points. The x-

means algorithm runs the k-means algorithm with k=2 repeatedly to split the cluster

centers into regions. After each run of 2-means, decisions are made for the subset

62

of the current cluster as to whether it should be split or not. The decision between

the subset of each center and itself is done by comparing the quality of the two

structures. If the subset of a center is better than the current center according to a

quality metric, the algorithm replaces the cluster center with its subsets. Originally,

x-means applies BIC as a quality metric, though other scoring criteria could also be

applied. Experiments show that the x-means algorithm finds the natural clusters

accurately and it is faster than repeatedly applying accelerated k-means for different

values of k [109].

4.3.2 Required Length (Number of Symbols)

Our clustering approach depends on context learning from the sequences with the

first l symbols, which allows us to work on incomplete sequences. It is desired to

determine the best number of symbols (l) in two ways: 1) a sequence length l has to

be long enough to detect the similarity/dissimilarity between sequences, 2) a sequence

length l has to be short enough to recognize the context in near real-time.

To perform this task, we applied sequence similarity based clustering techniques.

First, we used two sequence similarity measures to compute distance matrices be-

tween data samples by varying l. Second, we used k-medoids clustering algorithm to

generate the clusterings by using the previously estimated number of clusters. Finally,

we evaluated the goodness of clustering results to select best l value.

Similarity Measures

In this thesis we have experimented with two different measures to compute dis-

tances between sequence partitions: 1) Hamming distance and 2) Longest Common

Substring (LCSub).

63

Hamming distance is a metric based on counting the number of positions at which

the corresponding symbols are different in two sequences of equal length. Basically,

the Hamming distance is a kind of edit distance which measures the minimum number

of substitutions required to change one string into the other. The Hamming distance

between the sequences i and j can be computed by:

dHM(i, j) =l∑

x=1

δ(ix, jx) (4.3)

where l is the length of the sequences and δ(ix, jx) = 0 if ix = jx, δ(ix, jx) = 1 if

ix 6= jx . For example, the Hamming distance between two sequences “ABBABC”,

“ABABAA” is 4 since they do not match in 4 positions of the sequence.

Longest Common Substring is a similarity measure to find the longest string that

is a substring of two or more strings. Note that Longest Common Substring is a dif-

ferent measure than the Longest Common Subsequence, since the latter only considers

consecutive common symbols in both sequences. For example, the Longest Common

Subsequence of the sequences “ABBABC”, “ABABAA” is string “BAB” of length

3 and the Longest Common Substring is string “ABAB” of length 4. We compute

distance based on the Longest Common Substring between two sequences i and j as

follows:

dLCSt(i, j) = l − length(LCSti,j) (4.4)

where l is the length of sequences. Since the Longest Common Substring is a

similarity measure, we extracted dissimilarity by subtracting it from the length (l).

We apply both the Hamming distance and the Longest Common Subsequence

measures to compute a distance matrix. Since we extracted a distance matrix for

each l value in a predefined range, we normalize the distance values to eliminate the

64

impact of the length.

d(i, j) =dHM(i, j) + dLCSt(i, j)

2 ∗ l(4.5)

For example, the distance between sequences “ABBABC” and “ABABAA” is

(4 + 3)/12 = 0.58.

Clustering by K-medoids

Originally, the k-means algorithm was designed for the clustering of data vectors

which can be represented in an Euclidean space. Each cluster is centered about a

center point (centroid) which is the mean of the coordinates of the data points in the

cluster. There are two main requirements of k-means, which makes it inappropriate to

cluster sequential data: 1) the k-means algorithm requires that the distances between

the data points and the cluster centroids be calculated at each iteration. 2) the k-

means algorithm requires vector data, since calculating the distances from centroids

involves vector operations.

In our case, we have sequential data points, where the distance values between

datapoints are not derived from an Euclidean space. K-medoids is a variant of the

k-means algorithm where the center points (clustroids) of each cluster are chosen

from the data points, instead of computing the means. Clustroids are the data points

selected which minimize the total distance (4.6) in a cluster.

∑j∈Ci

d(i, j) (4.6)

where Ci is the cluster containing data point i, and d(i, j) is the distance between

i and j.

65

There are two main advantages of k-medoids, which makes it quite appropriate for

clustering sequential data: First, it does not require a repeated distance calculation

at each iteration since the medoids are the actual data points. Second, it doesn’t

require a vector data to compute dissimilarities since a distance matrix computed

by considering the data points . The steps of the k-medoids algorithm [116] are as

follows:

1. Select k samples at random to be the initial cluster medoids.

2. Assign each sample in the dataset to the cluster of the closest medoids.

3. Update the set of medoids by considering Eq 4.6 .

4. Repeat steps 2 and 3 until the medoids become fixed.

We used k-medoids clustering to train a context classifier which is used to classify

new data samples based on the cluster centers.

Cluster Quality

There are two basic measurements that can be used to evaluate quality of a clustering

result: 1) compactness (tightness), which is a measure of the quality of each cluster,

and 2) separation, which is a measure of the quality of the distance between clus-

ters. Silhouette is a graphical representation proposed for partition-based clustering

techniques. A silhouette captures both the clustering tightness and separation [119].

The silhouette value can be used to determine which data samples lie well within

their cluster and which do not [76]. The average silhouette value provides a metric

for judging clustering validity and quality.

The computation of a silhouette value for a data sample depends on two measures:

1) a(i): the average dissimilarity of i to all other data samples within the same cluster,

66

and 2) b(i): the smallest average dissimilarity of i to all data samples in other clusters.

While a(i) is captures how well sample i fits in the assigned cluster (the smaller the

a(i), the better the assignment), b(i) shows how well the sample i is different from

the other clusters. The cluster with the smallest average dissimilarity is called the

neighbor cluster, which is the second-best choice for sample i.

For each data sample i, the silhouette value can evaluated as follows:

silhouette(i) =

1− a(i)/b(i) if b(i) < a(i),

0 if a(i) = b(i),

b(i)/a(i)− 1 if b(i) > a(i).

(4.7)

A silhouette(i) is a value between [-1,1]. A value close to “1” shows that the

data sample is properly clustered. A value close to “-1” shows that the data sample

is poorly clustered and it may be that the sample should be moved to the neighbor

cluster. A value close to “0” shows that the data sample is on the border of the current

and the neighbor cluster, and so it is unclear which cluster the sample belongs to.

The general formulation of computing a silhouette(i) is given by:

silhouette(i) =b(i)− a(i)

max{a(i), b(i)}(4.8)

In order to compute a silhouette value to evaluate a clustering result, we only

need the distances between the data samples and the clustering results of these data

samples [119].

silhouette =1

n

n∑i=1

silhouette(i) (4.9)

The overall average silhouette is the mean of silhouette values for all data samples

in a data set (as shown in 4.9), where n is the number of data samples in the dataset.

67

The silhouette measure can be used to determine the number of clusters by selecting

the cluster number which maximizes the overall average silhouette value [76].

We have used silhouettes to estimate the required length l of a sequence partition

for clustering. Here, l represents the length of a partition in a sequence, starting from

the first symbol in the time sequence. We compute the overall average silhouette for

each clustering result, which are obtained by varying l values.

4.4 Summary of Context Learning

There are two main goals of the context learning process. First, we need to determine

the contexts in a training dataset to be able train a model for each context during

the training phase. Second, we need to recognize the contexts in a test dataset to

be able to evaluate each sequence with a corresponding model produced from the

testing phase. We also want to find the context of a test sequence without requiring

the entire sequence, since we need near real-time evaluation capabilities even for very

long sequences. In order to meet this constraint, we estimate the minimum required

length (l), which is basically the number of symbols that we need to recognize the

context of a sequence.

In the context learning process, we assume that clustering feature vectors helps

to find natural classes in the dataset. We refer to these classes as contexts. In order

to extract feature vectors we apply our tf-idf method which weights the symbols

according to their importance in a given sequence and also across the entire sequence

dataset. Then, we run the x-means clustering algorithm to estimate the number of

natural clusters in a dataset.

In the context learning task, the minimum required length (l) is determined only

68

for test purposes, so we can recognize a context quickly. In order to estimate the

shortest length (l), starting from the first symbol in a sequence, we implement the

following steps. First we compute the distance matrices between sequences by varying

the length l. We use two string similarity/dissimilarity metrics, namely hamming

distance and longest common substring, to find the distances. Then we apply the

k-medoids clustering algorithm on these distance matrices. Finally, the estimation of

l is performed by comparing the quality of obtained clusters using silhouettes.

The context learning presented in this chapter enhanced our framework in two

ways: 1) we apply a context-aware behavioral learning approach during training, 2)

we apply a context-aware anomaly detection approach during testing.

69

Chapter 5

System Call Anomaly Detection

System calls provide an interface between an application and the operating system’s

kernel. Since a program frequently requests services via system calls, a trace of these

system calls provides a rich profile a program behavior. In this section, we present

our approach to the system call anomaly detection problem step by step. We describe

the details of our system call anomaly-based IDS. The IDS consists of two phases,

each with a number of steps:

1. Training, whose steps include: a.) Preprocessing to differentiate the various

contexts in the training dataset and to generate features. b.) Model learning to

build a model of normal behavior for each context by training a HMM for each

cluster.

2. Testing, whose steps include: a.) Preprocessing to classify the test sequence into

one of the previously learned contexts and to generate features. b.) Computing

Anomaly score to identify anomalous behavior as deviations from the model of

normal behavior. c.) Anomaly detection to provide an alarm for the abnormal

behavior that deviates from the norm as a possible threat after filtering.

70

Figure 5.1: Design of our system call anomaly detection framework.

An overview of our system call anomaly detection framework design is presented

in Figure 5.1. Implementation of this framework is described in three sections: First,

structure of experimentation dataset is detailed. Second, preprocessing and feature

extraction from this dataset is discussed. Third, behavior learning is implemented by

training HMMs and finally experimental results are presented.

71

5.1 System Call Trace Dataset

We evaluate our proposed approach on a well-known system call dataset provided by

the University of New Mexico (UNM) [136]. In the UNM benchmark dataset, each

program dataset includes several system call traces which are generated by tracing

various normal and intruded runs of a program. In this thesis, we use the sendmail

traces from the UNM dataset in our experiments. Each program trace includes system

calls associated with the corresponding process IDs, as shown in Table 5.1.

Table 5.1: A sample of the UNM program trace.

Process ID System Call552 19552 105551 5552 104552 104551 5552 106552 105

A trace of a program execution typically includes multiple processes, as it is seen

in Table 5.1. In this thesis, system calls are grouped together according to their PIDs

to create an ordered system call list for each process, as presented in Table 5.2.

Table 5.2: A sample of extracted process traces.

PID:551 PID:5525 195 105

104104106105

72

We applied this PID partitioning on each trace provided in the UNM sendmail

dataset. A system call sequence for a process in UNM sendmail dataset is shown in

Table 5.3. In this table, each number represents an index to the system calls matching

in the provided mapping file. For example, the number 1 represents system call fork,

the number 5 represents system call close, etc.. In this work, these process traces are

used for training and testing.

Table 5.3: System call sequence for PID:552.

19 105 104 104 106 105 104 104 106 105 104 104 106 54 4 5 5 40 40 4 50 5 38 1 105 104 104 106 112 19 19 105104 104 6 6 106 78 112 105 104 104 106 78 93 101 101100 102 105 104 104 106 93 88 112 19 128 95 1 5 95 6 695 5 5 5 5 5

5.2 Preprocessing

To train a training model and identify the anomalous traces accurately, we have

found that effective preprocessing of system call traces is key. Our preprocessing

approach for the system call datasets includes four steps: 1) data partitioning, 2) data

reduction, 3) context learning and 4) feature extraction.

5.2.1 Partitioning

In Section 4.1 we presented the UNM sendmail system call trace dataset and discussed

how we partitioned these program traces into process traces. After this partitioning,

there are 346 processes in the normal execution and 25 processes in the abnormal

execution. The names of the program execution traces and the number of processes

73

are shown in Table 5.4. In this table, the # of processes column in the intrusion data

shows the number of abnormal processes versus the total processes in the correspond-

ing trace. For example, 1 of 6 means that there are 6 processes in the corresponding

execution and only 1 of the 6 process traces includes anomalous execution. Although

there are 25 process traces in the intrusion dataset, actually there are 13 abnormal

processes. Since abnormal program execution typically includes normal processes, in

our experiments we defined a program as abnormal if it includes at least one intruded

process.

Table 5.4: The UNM sendmail trace dataset.

Normal IntrusionFile # of File # ofName processes Name processesbounce 4 sm-280 1 of 6bounce 1 3 sm-314 1 of 6bounce 2 7 fwd-loops-1 2plus 26 fwd-loops-2 1log 147 fwd-loops-3 2queue 12 fwd-loops-4 2daemon 147 fwd-loops-5 1 of 3

sm-10763 1sm-10801 1sm-10814 1

7 normalprogramtraces

346 normalprocesstraces

10 intrudedprogramtraces

13 intrudedprocesstraces

5.2.2 Reduction

In our second step of preprocessing, we remove identical process traces from the

normal dataset to avoid the possibility of using the same sequences in training and

testing. Identical process traces are identified by comparing each process trace with

74

the other process traces. We store only one copy of a repeated process in our dataset.

After this filtering is completed, we have 68 unique process traces in the normal

dataset and 10 unique process traces in the abnormal dataset. We then use this subset

of the traces in training and testing. Although applying our reduction pass over the

abnormal process traces is not necessary, we want to have a consistent reduction

method for all traces (since we will not know if a trace is anomalous until later in this

process).

5.2.3 Clustering for Context Learning

Since a program trace includes execution that cover a number of normal behaviors, we

analyzed unique process traces to detect any structural similarities between processes.

These similarities help us to define a context-aware learning for each cluster. When we

examine all of the unique processes, we cluster the UNM sendmail process traces into

3 sets based on similarities. This analysis led us to train 3 HMMs, one for each cluster.

The concept behind using multiple-HMMs is based on expecting better learning to be

done using HMMs when we have similar training sets. Clustering results of normal

and abnormal processes on a PID basis is shown in Table 5.5. In the rest of chapter,

training, validation and test data are produced by considering this clustering process.

5.2.4 Feature Extraction

One of the most important tasks to perform during behavior learning is extracting

useful and efficient features to profile normal behavior. This step plays an important

role that impacts the success of the training phase. Before we get into the details of

feature extraction, we define the structure of the system call data set;

Definition 5.1: A dataset (D) is a set of system call sequences (S) and it is

75

Table 5.5: Clustering results of UNM sendmail processes

Normal AbnormalSet 1PIDs

Set 2PIDs

Set 3PIDs

Set 1PIDs

Set 2PIDs

Set 3PIDs

551 552 553 170 283 1631407 1402 554 162 317 18312376 1551 1403 182 119 20712387 8844 1409 20612398 1411 1076512409 1414 1080312420 1574 1081612431 157712272 157812827 158212838 158312849 1237812883 1240012900 1241112908 124228840 12433

128071282912840128511290212910

represented as;

D = {S1, S2, S3, ..., Sm} (5.1)

where m is the number of sequences (processes), i=1, 2, ..., m; each Si is the

sequence of system calls. In this example, D represents all unique process traces

extracted from sendmail dataset.

Definition 5.2: A system call sequence (S) for a process shown in Table 5.3 can

be represented as:

76

S = {s1, s2, s3..., sn} (5.2)

where n is the number of system calls in S and i=1,2,...n; si ∈ S is a system call.

In this thesis, we generate fixed-length subsequences (features) by sliding a window

across the process traces and recording each unique subsequence. If we slide a window

of size k on a sequence S given at Equation 5.2 , then we generate the following

subsequences:

{s1, s2, s3..., sk}

{s2, s3, s4..., sk+1}

{s3, s4, s5, ..., sk+2}

...

...

{sn−(k−1), sn−(k−2), ..., sn}

If each of the generated subsequences are unique, there are at most n − (k − 1)

subsequences as a result. One of the most important parameters of using fixed-length

pattern extraction is deciding the appropriate window size k. Researchers have used

different window sizes in previous work to find the optimal k. Many of studies found

that the minimum required window size is “6” for system call anomaly detection on

UNM dataset [146, 50, 87, 130]. Eskin et al. [50] proposed that the optimal window

size is different for each program in the UNM dataset. They compute the conditional

entropy of each program trace for different window sizes and claimed that measuring

the regularity of data can be used to pick the window size. Lee and Xiang [87] also

proposed a similar approach that used conditional relative entropy to estimate the

window size. An alternative approach suggests that conditional entropy may not be

a universal selection metric and supported this position by measuring the regularity

77

of completely random data [130].

As discussed earlier, an intrusion trace may include normal process traces in ad-

dition to the abnormal trace(s). In the UNM dataset, one of the abnormal traces

of the sendmail program (sm-280.int), which performs a decode attack, includes six

different processes. Only one of them (PID:283) is considered as abnormal, since the

other process traces also appear in the normal process trace set. A detailed analysis

of the sendmail program traces shows that the suspect process trace PID:283 in the

sm-230.int file is not detectable if a window of shorter than 6 is used [131]. Therefore,

using a k value smaller than 6 on process PID:283 produces subsequences present in

the normal process traces. In other words, a narrow window produces information

loss during the subsequence production process.

In the system call analysis described in this thesis, features (subsequences) are

extracted by using a sliding window length 6 using a step increment of 1. To generate

the normal and the test data, we extract unique subsequences for each process trace

in the program traces. In our experiments, normal data refers subsequences gener-

ated from all normal process traces, and abnormal data refers to the subsequences

generated from abnormal process traces. For each cluster, normal data is randomly

partitioned into two subsequence sets: 50% for the training set and 50% for the val-

idation/test set. Abnormal data is also added to the test sets for each cluster by

considering the process trace where it is generated. For the evaluation of the trained

model on each cluster, extracted subsequences are considered individually. To per-

form anomaly detection on a process trace, the subsequences that are extracted from

the corresponding process trace are used during the evaluation.

78

5.3 Behavior Learning -Training

In this experiment, half of the subsequences extracted from the normal traces are

used in training. We applied 10-fold cross validation to train and evaluate the model,

therefore, only 45% of the normal dataset is used in the training phase for each fold.

In the training phase, we defined the dimensions and the observation sequences

of the HMM to estimate the model parameters λ = {A,B, π}. The number of ob-

servation symbols M is equal to the number of unique system calls in the dataset.

In the sendmail dataset, there are 53 unique system calls in both normal and ab-

normal traces. Subsequences generated via a sliding window are used as observation

sequences in the HMM model. We set the window length T of the observation se-

quence equal to the sliding window length k. While choosing the observation sequence

length, we considered the benchmark dataset which needs at least k ≥ 6, as discussed

in Section 5.2.4. We limit the length to 6 since learning with fewer parameters results

in faster training.

When training a model with a HMM, it is possible to increase the likelihood value

by increasing the number of parameters in the model. But increasing the number

of parameters can lead to overfitting. To address this issue, we applied Bayesian

Information Criterion (BIC) [124] to select the number of hidden states in our model.

BIC introduces a penalty term for the number of parameters, while computing a

criterion score based on the maximized likelihood. Equation 5.3 provides the details

on how we compute BIC:

BIC = −2 ln(L) + p ln(n) (5.3)

where L is the maximum likelihood, p is the number of free parameters and n is

the number of data points. We experimented by varying the number of hidden states

79

N from 10 to 100. Normalized BIC results for this range of hidden states are shown

in Figure 5.2. Since we prefer to use the simplest model that best fits the data, lower

BIC scores identify the candidate numbers of the hidden states to be chosen for the

model. The lowest BIC values for each fold are found to vary between 40 and 60

hidden states. By considering this range of BIC values, we selected N=53, which is

also equal to the number of unique system calls in UNM sendmail traces.

Figure 5.2: BIC values for various number of Hidden States

5.4 Test and Evaluation

In testing phase, we followed our clustering approach and subsequence extraction

method on process traces. Each set is tested with the corresponding normal test data

extracted from normal process traces and the abnormal data extracted from abnormal

process traces. Results of the testing process are shown using ROC analysis: Figure

5.3 is for the subsequences of 1.Set, Figure 5.4 is for the subsequences of 2.Set and

80

Figure 5.5 is for the subsequences of 3.Set.

Figure 5.3: ROC curve for Set 1

We present the validation test results reporting Area Under the Curve (AUC)

values for each fold of the trained model in our ROC analysis. Our Validation set

includes 50% of all of the normal subsequences and all abnormal subsequences for the

corresponding set of process traces. Results show high detection rates with very low

FPRs for the short sequences.

In Figure 5.6, we have presented the ROC results when using only one HMM for

training, and without considering the clusters. Although the test results are able to

detect most of the abnormal subsequences, we achieved better detection rates via

clustering and training a HMM for each cluster.

81



Test results on subsequences of the process traces show that our clustering and train-

ing multiple HMMs approach is successful on detecting abnormal subsequences with

high AUC values for each set. This also shows that the LL values can be used to

predict whether a subsequence is abnormal or normal. But instead of evaluating sub-

sequences individually, an IDS needs to analyze the process trace to decide whether

it is normal or abnormal. To assign an anomaly score for a process trace, we applied

Exponentially-Weighted Moving Average (EWMA) as a filter on the log-likelihood

values of the subsequences within each process. Since EWMA has first introduced by

Robert [117], it has been widely used for detecting shifts in the mean of a sequence

of discrete values. EWMA applies weights on discrete decision values in an exponen-

tially decreasing order to smooth out fluctuations; the most recent values are weighted

82


most highly. In our case, decision values are the log-likelihood values of subsequences

in process traces. If EWMA values reach a threshold at some point of evaluation,

our system generates an alarm for the considered process trace. To compute EWMA

values for each subsequence in a process trace, we used (5.4):

EWMAt = αYt + (1− α) EWMAt−1 (5.4)

where Yt is the decision value (LL value) at time t, α is the weight which determines

the depth of memory. EWMAt is the value of the EWMA at time period t. α is a

constant smoothing factor between 0 and 1 and can be computed according to a

desired filter width W, as: α = 2W+1

. If α is close to 1, it gives more importance to

recent decision values and discounts older values faster.

Since the EWMA filter generates an anomaly score at each point of the process

83

Figure 5.6: ROC curve for one HMM

trace, there is no need to wait for us to evaluate the entire process sequence to detect

abnormal behavior. Figure 5.7 presents EWMA values on a normal and an abnormal

process trace. While there is some noise in the normal process, the EWMA filter

smooths the noise. Although the abnormal process behaves normally at the beginning

of the trace, it exhibits abnormal behavior after some point. In this instance, EWMA

values are high enough to detect abnormal behavior when an anomaly starts.

In the UNM sendmail dataset, there are 78 distinct process traces, and 10 of those

are considered as abnormal according to the ground truth generated by applying the

stide mechanism described earlier [66]. Figure 5.8 presents the ROC curve which is

found by varying the threshold value on the maximum EWMA filter output of each

distinct process trace.

84

Figure 5.7: Time-series plot for process traces.

5.6 Summary of System Call Trace Analysis

Sequential behavior modeling and prediction is only possible if the data includes

some probabilistic structure. HMMs are one of the most effective dynamic behavior

modeling approach to learn temporal relationships between system calls for anomaly

detection based intrusion detection.

Although there has been significant prior work using HMM on learning program

behavior for anomaly detection, in this application we present a new process for pre-

processing traces, leading to better results for anomaly detection. To differentiate

between various behaviors (contexts), we applied similarity-based clustering on sys-

tem call sequences in benchmark dataset. This approach captures similar behavior

across processes, starting with similar sequential structures. Behavioral clustering

85

Figure 5.8: Process-based evaluation results.

approaches can be improved by applying a more sophisticated clustering algorithm

using a time window.

In this illustrative example, we provide an on-line anomaly detection framework by

using a dynamic anomaly score which is computed for each time step using an EWMA

filter. EWMA provides smoothing of the decision value versus using computed log-

likelihood values of short sequences. Applying the EWMA filter to log-likelihood

values has two advantages in our anomaly detection framework. First, it provides

lower FPRs by smoothing the decision values, and second, our detection model can

run online detection by detecting abnormal behavior at the point of penetration.

Although HMMs are fast enough for on-line anomaly detection during testing,

one of the main drawbacks of using HMMs is their significant computation time

while training. The training dataset size and the observation sequence length both

86

have an huge impact on training time. To address this issue, we used only use 45% of

the normal data for training and clustered those data into three sets to train a HMM

for each set of process traces. We also selected the shortest possible window length

that catches all abnormal subsequences in UNM sendmail traces while extracting

observation sequences.

In this implementation, we have presented an illustrative example to describe

how to apply our proposed approach on system call traces for cyber security. We

considered the details of the UNM system call dataset by considering various normal

behaviors in these programs. Test and detection results show the proposed approach

provides faster and accurate anomaly detection by context-aware behavior learning.

87

Chapter 6

Crowd Anomaly Detection

Prior work in crowd analytics has generated a large number of research studies that

consider a wide range of analysis approaches. In this chapter, we describe our ap-

proach for anomaly detection in a crowd scene to demonstrate the effectiveness and

adaptability of our framework. In our implementation, we show how to generalize

a crowd analytics problem into a behavioral learning problem, while working with

sequential data.

Anomaly detection is one of the main problems in the crowd analysis domain

where a great deal of prior work has been presented. There are two main categories

of behavior learning: 1) object-based (detection-based), and 2) holistic-based. In this

thesis, we explore detection-based methods to extract features, since they provide

more accurate information about the targets in a scene, especially when the crowd

density is low. We have applied our sequence anomaly detection approach to identify

collective behaviors of pedestrians in a crowd.

An overview of our crowd anomaly detection framework design is presented in

Figure 6.1. Implementation of this framework is described next. First, the structure

88

Figure 6.1: Design of our crowd anomaly detection framework.

of experimentation video dataset is detailed. Second, the preprocessing and feature

extraction process is discussed. Third, a behavioral learning method is implemented

to train the HMMs. Finally, experimental results are presented.

6.1 Event Recognition Video Dataset

In video analysis, event recognition is defined as the process of monitoring and ana-

lyzing the events that occur in a video surveillance system in order to detect signs

89

of security problems [71]. It is important to mention that our research is not trying

to detect abnormal objects (e.g., abnormal object size, speed or direction) [95, 15].

Instead, we aim to define behavioral properties of people and recognize the events in

the video to detect abnormal events, such as a sudden dispersion, running or merging

of a large number of people [165, 160].

We presented a background study on crowd analysis in Section 2.2 and discussed

related work on behavioral learning for anomaly detection in a crowd scene in Sec-

tion 3.2. There is a large amount of research for feature extraction the area of com-

puter vision. Since our goal is not trying to develop new feature extraction methods,

we have selected well-known techniques to extract features from a video dataset.

To drive evaluation of our implementation, we have identified benchmark datasets

available for this purpose. The characteristics of our targeted video is presented

in Table 2.2 (indicated in italics). The benchmark dataset used in our experi-

ments is PETS09 (Performance Evaluation of Tracking and Surveillance 2009) crowd

dataset [52, 111] which was recorded for the workshop at Whiteknights Campus,

University of Reading, UK. PETS09 is a publicly available benchmark video dataset

which is generated to address three different crowd analysis problems: 1) Dataset S1:

person count and density estimation, 2) Dataset S2: people tracking, 3) Dataset S3:

flow analysis and event recognition [49, 150]. The event recognition dataset (Dataset-

S3: High Level) contains four video records with timestamps 14-16, 14-27, 14-31 and

14-33. These video records contain one or more of the following set of events [52]:

• Walking: Moving at a “typical” walking pace.

• Running: Moving at a “typical” running pace.

• Evacuation: Rapid dispersion, multiple divergent flows.

90

• Local Dispersion: Multiple, localized, divergent flows.

• Crowd Formation-Gathering/Merging: Convergence of multiple flows.

• Crowd Dispersal-Splitting: Multiple divergent flows.

Two example frames from event recognition videos are shown in Figure 6.2. There

is a running event in both samples. The video specifications are given as; number of

frames (Nf ): 1076, resolution (R): 768x576, frame per second (FPS): 7.

Figure 6.2: Dataset-S3 Events, frames 50 and 150 (left-to-right) [52].

In this illustrative example, we use view-1 video records as our single camera view.

If there are additional video sequences in a given video record, we name them A, B

or C appended to the corresponding video record name. The sequence names, time

intervals and their lengths are provided in Table 6.1.

Since the ground truth of events is not provided for the event recognition dataset,

we composed it manually. In order to ensure accuracy of our ground truth, we also

compared it with the results reported in prior papers that utilized the dataset [57, 26,

133]. In Table 6.2, we present the ground truth for the events in these video sequences,

identifying time intervals using frame numbers. In this table, columns represent the

video sequences and rows represent the events. We also defined an additional event,

91

which we label as loitering, since there are sequence of frames which do not include

running nor walking events.

Table 6.1: Video sequences in the PETS09 Dataset-S3:High Level.

Sequences Time Intervals Length14-16 A 14-16 [0 107] 10814-16 B 14-16 [108 198] 9114-16 C 14-16 [199 222] 2414-27 A 14-27 [0 184] 18514-27 B 14-27 [185 333] 14914-31 14-31 [0 130] 13114-33 A 14-33 [0 310] 31114-33 B 14-33 [311 377] 68

Table 6.2: Time intervals of the events in the PETS09 Dataset-S3:High Level.

EVENTS 14-16 A 14-16 B 14-16 C 14-27 A 14-27 B 14-31 14-33 A 14-33 BLoitering [-] [-] [-] [0-94] [131-184] [185-271] [301-333] [-] [181-310] [311-337]Walking [0-39] [108-174] [-] [95-130] [271-300] [0-130] [0-180] [-]Running [40-107] [175-198] [199-222] [-] [-] [-] [-] [338-377]Evacuation [-] [-] [-] [-] [-] [-] [-] [338-377]Dispersion [-] [-] [-] [95-130] [271-300] [-] [-] [-]Merging [-] [-] [-] [-] [-] [-] [0-180] [-]Splitting [-] [-] [-] [-] [-] [60-130] [-] [-]

We detect and track people in the PETS09 Dataset-S3:High Level benchmark

videos in order to extract multi-object features for each frame. Next, we explain the

feature extraction processes in more detail.

6.2 Preprocessing

We performed a set of data preprocessing steps on each video dataset to prepare it for

training and testing tasks in our crowd anomaly detection framework. These steps

92

include: a.) feature extraction from video, b.) context learning and c.) windowing.

Next, we discuss each of these preprocessing steps in detail.

6.2.1 Feature Extraction from a Video

Feature extraction refers to a set of techniques that are applied to extract symbolic

discrete sequences from a video (frame sequence). These techniques include both

computer vision algorithms used to extract features, and machine learning algorithms

to classify the data in the form of a symbolic sequence.

In Figure 6.3, we present a generic view of a video sequence. Video, which is

a sequence of image frames, is a rich source of information. In order to extract

information from a frame and use it to drive our experiments, we need to apply

computer vision techniques. In our framework, we extract multi-object based features

for each frame in a video stream by tracking individuals in a crowd.

Object tracking is one of the most heavily researched areas in computer vision.

Although tracking is a research subject by itself, it also used in a number of other

areas of research including object recognition, motion recognition, traffic monitoring,

and others. In our work we adopt a tracking by detection methodology [154] to track

multiple people in a video sequence. Then, we extract histogram features for each

frame and prepare these features to use in training. In our framework, we break fea-

ture extraction into four steps: 1) detection, 2) tracking, 3) projective transformation

and 4) feature histogram generation.

Object Detection

In order to detect an object, the most desirable property of a visual feature is its

uniqueness, especially if we want to distinguish objects using a feature space. For

93

Figure 6.3: Feature extraction from a video sequence.

this purpose we compute motion, size and direction features of the objects in every

frame.

We utilize blob detection by first computing a background subtraction. In our

framework, the first step is to discriminate between moving foreground objects and

the background. The background can be determined by computing the distribution

of the typical values each pixel takes, and the foreground can be detected by iden-

tifying pixels values that are sufficiently different from the background distribution.

Large connected foreground regions form blobs and can be considered as detected

objects [154].

94

Object Tracking

Tracking can be defined as an assignment process to connect and label a sequence

of objects across multiple image frames in a video. We use position, velocity and

size features of the objects in the video frame, and record object state information.

To perform tracking, we predict the next state of an object by using its current and

previous states via Kalman filtering [128, 37].

Projective transformation - Homography

The tracking information we extract is not a top-down view, since we are using a

single fixed camera view and the camera is seeing the surveillance area from an angle.

Using this data directly could be problematic when computing of distance and speed

features. In order to accurately identify an objects positions, we applied a projective

transformation technique on the tracked points [106]. A projective transformation,

which is also called homography, is a geometric transformation between two image

planes. Figure 6.4 presents the top-down view of the area and the camera position.

Figure 6.4: Top-down view of the surveillance area and camera position.

95

For each tracked points in each frame, we have applied our projective transfor-

mation and used the resulting transformed tracking information to extract feature

histograms. A reference image frame from the PETS09 event recognition dataset is

shown in Figure 6.5 (a) and its transformation is shown in Figure 6.5 (b).

(a) Original image frame. (b) Transformed image frame.

Figure 6.5: Projective transformation (Homography) result of an original image framefrom PETS09 event recognition dataset.

Histogram Extraction

Although we can detect and track individuals, our aim is to analyze collective be-

haviors of people from sequence of frames. In this application, we extracted three

types of features, which are distance, velocity and direction, for each frame, based on

the extracted multi-object features. We generate three feature histograms from the

extracted features:

• Distance-based features: The position features generated by object tracking are

considered as nodes in a graph for each frame, and the distance between each

96

pair is computed to extract distance-based histograms for each frame. The

histogram is normalized by the number of distance values.

• Velocity-based features: For each object, the positional distance change is com-

puted using the current and the previous position in order to extract velocity-

based histograms, computed in terms of pixels. The histogram is divided into

5 equal bins (1) very slow, 2) slow, 3) normal, 4) fast, and 5) very fast), cap-

turing the minimum and the maximum speed of people in the video dataset.

Then, the histogram is normalized by the number of people.

• Direction-based features: While computing the velocity of each object, the di-

rection of the position change is also extracted to generate direction-based his-

tograms for each frame. Direction of an object is determined to be in one of 8

bins in the range [0, 2π). A histogram is computed for each frame that includes

the relative number of people in each of the 8 bins.

The reason of extracting multiple types of features is so we have comparable event

types for a range of applications. For example, while running or walking events can

be learned via velocity-based features, splitting, local dispersion and merging events

can only be learned if we include distance and/or direction-based features. Since the

defined events are not correlated to the number of people in the scene, we generated

normalized feature histograms to avoid any bias depending on the number of people

in a scene.

Symbolic Representation of Histograms

Our work aims to analyze the behavior of people using a sequence of frames. These

behavior need to be distilled to a set of symbols. By generating feature histograms we

97

can convert the video data to a sequence of feature histograms. Instead of analyzing

feature histograms directly, we need to cluster these histograms to create a common

representation (symbol) for each similar frame. By representing each frame with a

symbol from a finite set of symbols, video analysis for the event recognition problem

becomes a problem of event recognition in a symbolic/discrete sequence. In order

to create such symbols, we used the x-means clustering algorithm using euclidean

distances.

6.2.2 Clustering for Context Learning

The PETS09 dataset was not originally designed to be used for testing anomaly

detection algorithms, but instead designed to test recognition of events. In order to

detect abnormous activities, first we need to define the anomalies that we are looking

for. We defined an abnormal event as a behavioral change according to a learned

behavior, such as sudden changes in velocity or direction of people. The problem we

address is detecting these kind of anomalies from a sequence of symbols.

As presented in Table 6.1, there are only 8 sequences (video records) in the entire

dataset. We can not use these sequences directly for training or testing for two reasons:

1) Number of sequences (8) is not enough to work with. 2) Training dataset is need

to include only normal sequences, and these 8 sequences are including abnormal time

intervals (event changes). Therefore, before starting context learning, we first need

to identify training and test sequences.

In previous work, researchers looked for various kinds of anomalies and applied

different approaches while selecting training and test sequences. For example, anoma-

lies can be detected in each sequence by training with a normal partition of the same

98

sequence [140]. The authors define normal partitions by manually selecting time in-

tervals that contain only normal events using their definition of an anomaly. We

follow a similar methodology, but instead of training and testing each sequence indi-

vidually, we partition the sequences to generate our training dataset. Our partitions

include multiple normal and abnormal behaviors. Partitioning is performed by di-

viding the sequences into non-overlapping smaller video sequences. In Table 6.3, we

present these smaller video partitions by assigning a sequence ID, and any partition

that includes an event change is labeled as abnormal. Only the partitions void of any

behavioral changes are considered as normal. Following this partitioning approach,

we obtain 33 sequences which include 24 normal and 9 abnormal sequences.

In our video analytics application, although the contexts can be identified manu-

ally by considering predefined events, we followed our context learning approach by

assuming we are given a set of symbolic sequences which we know are normal. As

discussed earlier, predefined events in the PETS09 dataset can be detected through

the type of features we have extracted. Since we are looking for changes in behavior to

detect them as anomalous, we want to first learn normal behaviors. Next we describe

our context learning process for each feature type.

• Velocity-based context learning: The symbolic sequences generated through the

velocity-based feature histograms can be used to learn normal behaviors for

loitering, walking and running events. While partitioning the velocity-based

symbolic sequences to generate a training dataset, we restrict our attention to

normal sequences for each of these behaviors. It is important to note that all of

these behaviors by themselves are normal. For example, while individuals that

are running do not constitute an abnormal event, this sequence is abnormal if

people suddenly start running after loitering or walking events.

99

Table 6.3: Non-overlapping sequence partitions generated from the PETS09 Dataset-S3: High Level.

Seq. ID Sequence Partitions LabelS1 14-16 A [0-30] normalS2 14-16 A [31-60] abnormalS3 14-16 A [61-90] normalS4 14-16 A [91-107] normalS5 14-16 B [108-140] normalS6 14-16 B [141-180] abnormalS7 14-16 B [181-198] normalS8 14-16 C [199-222] normalS9 14-27 A [0-40] normalS10 14-27 A [41-70] normalS11 14-27 A [71-110] abnormalS12 14-27 A [111-140] abnormalS13 14-27 A [141-184] normalS14 14-27 B [185-210] normalS15 14-27 B [211-250] normalS16 14-27 B [251-280] abnormalS17 14-27 B [281-310] abnormalS18 14-27 B [211-333] normalS19 14-31 [0-25] normalS20 14-31 [26 50] normalS21 14-31 [51-90] abnormalS22 14-31 [91-110] normalS23 14-31 [111-130] normalS24 14-33 A [0-50] normalS25 14-33 A [51-90] normalS26 14-33 A [91-130] normalS27 14-33 A [131-170] normalS28 14-33 A [171-200] abnormalS29 14-33 A [201-250] normalS30 14-33 A [251-310] normalS31 14-33 B [311-330] normalS32 14-33 B [331-350] abnormalS33 14-33 B [351-377] normal

100

• Distance-based context learning: The symbolic sequences generated through the

distance-based feature histograms can be used to learn normal behavior for

evacuation, dispersion, merging and splitting events. For example, during an

evacuation event, if all of the objects start by being close to eachother, suddenly

the distances between objects become larger. While large distances between the

objects are not abnormal by themselves, the sequence becomes abnormal when

the distances increase suddenly.

• Direction-based context learning: The symbolic sequences generated through

the direction-based feature histograms can be used to learn normal behavior for

evacuation, dispersion, merging and splitting events. For example, in the case

of a splitting event, while the objects are moving in the same direction, they

will start diverge in different directions. Although moving through different

directions is not deemed abnormal, a sequence will become abnormal after we

observe another behavior.

When analyze the datasets presented in Tables 6.2 and 6.3, we see that they

contain loitering, walking and running events. We also see that there are transition

in the sequence of events. In other words, people need to be in one these three

states, which we refer to as velocity-based events. Therefore, we can first apply

context learning process on velocity-based symbolic sequences that we have generated.

Details of our context learning approach are provided in Chapter 4. In order to use

these sequences during the training task, we randomly select the half of the normal

sequences from Table 6.4. In this table, we only include normal sequences where each

partition contains a single type of event (i.e., walking, running or loitering) over all

frames.

In this example, since the event types and the ground truth are already defined,

101

Table 6.4: Normal sequence partitions for velocity-based events.

EVENTS Sequence PartitionsLoitering S9, S10, S13, S14, S15, S18, S29, S30, S31Walking S1, S5, S19, S20, S22, S23, S24, S25, S26, S27Running S3, S4, S7, S8, S33

it is possible to evaluate the results of context learning as a velocity-based event

recognition. The applied context learning process generated three context clusters by

correctly grouping the sequences from the same event together.

We follow similar context learning steps for distance and direction based symbolic

sequences. In distance-based context learning, the results are separated into three

clusters. In direction-based context learning, the results are separated into four clus-

ters. We trained a HMM for each context in a given symbolic sequence type. These

results lead us generate three HMMs for velocity-based sequences, four HMMs for

distance-based sequences and three HMMs for direction based sequences.

6.2.3 Windowing

Since the video sequences are not of equal length, and the number of sequences is

small and insufficient to properly train our model, we applied a windowing technique

to extract fixed length subsequences from the generated symbolic sequential data. In

order to extract subsequences, we slide a window of length k across each sequence.

While selecting a good k value, we considered the required number of frames needed to

capture event changes in the PETS09 event recognition videos. We manually selected

k = 10, since this is the minimum number of frames we needed in order to visually

differentiate between two different events in a sequence of frames.

102

6.3 Behavior Learning and Anomaly Detection

In order to learn behaviors, we need to train the HMMs. Basically, HMM training is

defined as finding the best-fit model parameters. These parameters include the state

transition and observation symbol probability distributions, which are specific for a

given set of observation data.

The same training sequences for context learning are used during the HMMs

training. A windowing technique is applied to these training sequences in order to

generate observation data for HMM training. For each feature type, we trained a

HMM for every context that was identified during the context learning process. For

example, we trained three HMMs for velocity-based symbolic sequences by seeding

each of them with one of the observation data sets extracted from the sequence

clusters.

In our evaluation of our crowd analysis framework, we applied 10-fold cross val-

idation during training and testing of the model. In the training phase, we defined

the dimensions and the observation sequences of the HMMs to estimate the model

parameters λ = {A,B, π}.

The subsequences generated from using the sliding window are used as the obser-

vation sequences for the HMM model. We set the window length T of the observation

sequence equal to k, the width of the sliding window. The number of observation sym-

bols M and the number of hidden states N are set equal to the number of unique

symbols in each sequence type. While clustering the histograms, to capture their

symbolic representations, we use a unique number symbols for each type of feature

histogram. Therefore, we need to assign M and N value for each sequence type,

according to the size of the symbolic alphabet as listed below:

103

• Velocity-based symbolic sequences :∑

= 8 symbols

• Distance-based symbolic sequences :∑

= 15 symbols

• Direction-based symbolic sequences:∑

= 12 symbols

Next, we test and evaluate our ability to detect abnormal events in the subse-

quences.

6.4 Testing and Evaluation

In the testing phase, we followed all of the prepossessing steps we applied during

training. We can summarize these steps as follows:

• First, we extract three different features for each object in a frame: i) position,

ii) direction and iii) speed. After this point, every step is performed for each

feature type separately.

• Second, features from multiple objects are used to construct normalized feature

histograms for each frame, one for each feature type. This provides us with

three histograms for each frame, and three histogram sequences for each video.

• Third, we classify the histograms to aggregate the similar feature histograms

into a single symbolic value within the symbolic alphabet which was identified

in the training phase. Then, we generate three symbolic sequences for each

sequence of frames by representing feature histograms with their corresponding

symbols.

• Fourth, we attempt to identify contexts in the test sequences by classifying them

according to the first l symbols in the sequence. We previously discussed how we

104

determine the required length for this classification in Chapter 4. The values

for our three sequence types are lvelocity = 6 symbols, ldistance = 12 symbols,

ldirection = 11 symbols.

• Fifth, a windowing technique is applied to generate fixed-length subsequences

from the videos, which are then used during testing. It is important to note that

we are applying this subsequence extraction process within each context, only

after the context recognition step. So, we are careful not to mix subsequences.

Instead, the subsequences generated from a sequence are tested within the same

HMM.

• Sixth, trained HMMs are used for the evaluation of subsequences (observa-

tion sequences). The evaluation of an observation sequence can be defined as

computing the probability of observing that sequence from a given HMM. This

probability is based on computing the log-likelihood of the observation sequence

by applying forward algorithm [114]. For each subsequence in the test dataset,

these log-likelihood values are used as decision values to detect if there is an

anomaly.

Next, we present results for abnormal event detection in three types of sequences.

Velocity-based Context-aware Subsequence Test

In Figure 6.6, we show the ROC curve for 10-fold cross validation, first without

considering any context. In these results, the video sequences are represented using

only the symbols generated through the velocity-based features, and only one HMM

is trained for all the sequences.

105

Figure 6.6: ROC curve for the velocity-based symbolic sequences, training and testingwith only a single model.

In Figure 6.7 (a), (b) and (c), we show the results from same the symbolic se-

quences, but this time contexts are identified using our context learning approach.

Each subsequence is tested within the model generated from samples from the corre-

sponding context.

Direction-based Context-aware Subsequence Test

We present results for direction-based symbolic sequence without contexts in Fig-

ure 6.8 and in Figure 6.9 (a), (b), (c) and (d) with contexts.

Distance-based subsequence Test

Anomaly detection ROC curves obtained for the distance-based symbolic sequences

are shown in Figure 6.10 when using a single model and in Figure 6.9 (a), (b) and (c)

when using multiple normal models, one for each context.

106

(a) Context-1 test set. (b) Context-2 test set.

(c) Context-3 test set.

Figure 6.7: ROC curves for velocity-based symbolic sequences, training and testingon a per-context basis.

107

Figure 6.8: The ROC curve for direction-based symbolic sequences, training andtesting using only a single model.


Our proposed method for crowd anomaly detection depends on the evaluation of

symbolic sequences generated from three different features of the people in the crowd.

In previous section, we presented results generated by applying our methodology to

the generated subsequences. In this section, we present the detection results for the

video sequences given in Table 6.3. In order to evaluate a video sequence, we use the

anomaly score of subsequences, and applied EWMA as a filter.

A high-level design of entire crowd anomaly detection system is presented in Fig-

ure 6.12. Since there are three symbolic sequences for each short video, we evaluated

results for each of them individually, and then combined the results to a final anomaly

detection rate. We consider a sequence abnormal if it is detected as abnormal by any

of the three individual evaluations.

108


(c) Context-3 test set. (d) Context-4 test set.

Figure 6.9: ROC curves for direction-based symbolic sequences, training and testingon a per-context basis.

109

Figure 6.10: ROC curve for distance based symbolic sequences, training and testingwith only a single model.

6.6 Summary of Crowd Analysis

In this thesis, we proposed a new method to detect anomalies in a crowded scene. We

have characterized crowd behavior by extracting three different symbolic sequences

based on three feature types. We targeted predefined events in PETS09 event recog-

nition dataset. The features extracted were based on the velocity, distance and di-

rection of individuals present in each frame. Since anomalies have a wide range of

different characteristics, these features may be insufficient to detect all possible ab-

normal activities. For example, although the velocity-based symbolic sequences are

very helpful to learn different crowd behaviors, our results show that in the PETS09

dataset, distance-based and direction-based features are better predictors of abnor-

mal events. Therefore, we analyzed the video sequences by considering each of these

sequences separately to achieve a final anomaly detection result. In the end, we com-

bined the test results from all three symbolic sequences. If any of these tests detect

an anomaly, we labeled that subsequence as abnormal.

110


(c) Context-3 test set.

Figure 6.11: ROC curves for distance-based symbolic sequences, training and testingon a per-context basis.

111

Figure 6.12: High-level design of crowd anomaly detection framework.

A major challenge in crowd analysis is the generation of ground truth for the video

dataset. Since the PETS09 event recognition dataset only includes video frames and

the ground-truth is not provided either for tracking or event recognition, we had

to generate the ground truth manually. The ground-truth extraction for detection

and tracking of multiple objects is performed using an annotation tool presented by

Dollar et al. [44, 45, 42]. This annotation tool helps to annotate a video efficiently by

interpolating the detection points in the intermediate frames between the two frames

that were selected. By using this tool, we extract the ground truth for tracking each

object semi-manually.

In this experiment, we show that it is possible to detect crowd anomalies from

symbolic sequential representations of frames. With the illustrative example provided

in this thesis, we also showed the generality of our anomaly detection framework on

112

symbolic sequences. Our experimental results show that context learning achieves

significantly better detection performance than a no-context approach for the anomaly

detection task on symbolic sequences.

113

Chapter 7

Summary and Conclusion

This thesis has presented a new approach for improving anomaly detection on sequen-

tial data. The thesis focuses on a specific set of techniques targeting the detection of

anomalous behavior in a discrete, symbolic, and sequential dataset.

In Chapter 1, we motivated and outlined the goals of this thesis. We also high-

lighted some of the key challenges of performing anomaly detection problem on sym-

bolic sequences. In Chapter 2, we presented background information for anomaly

detection algorithms and discussed the two application domains we focused on in this

thesis: system call intrusion detection and crowd analytics. In Chapter 3, we sur-

veyed related work for two application domains in terms sequence anomaly detection

and motivated our work on context-based learning as it applies in different applica-

tion domains. In Chapter 4, we described our context learning approach as applied

on symbolic sequential data. In Chapter 5, we presented our system call anomaly

detection framework for Host-Based Intrusion Detection. In Chapter 6, we presented

our crowd anomaly detection framework for an automatic crowd surveillance system.

114

In this chapter, we summarize the contributions of this thesis and provide direc-

tions for future work.

7.1 Contributions of the Thesis

In our anomaly detection design, we focus on the basic research issues arising in

sequential data analysis, and claim that the anomaly detection task on symbolic se-

quential data can be improved by first identifying multiple normal behaviors from

a data source. The results of our illustrative examples indicate that sufficient in-

formation exists in symbolic sequences to model multi-class normal behavior of a

data source. The key to extracting this valuable information is to apply the proper

partitioning method.

In summary, this thesis enhances the current state-of-the-art and makes key con-

tributions to the following areas:

• Machine learning/ Datamining: We have devising a novel, effective and generic

context-aware anomaly detection framework for discrete sequential data by

adapting sophisticating machine learning algorithms.

• Application domains: We applied our novel anomaly detection approach to two

different application domains, each of which is critical to security, cyber and

physical in nature.

The framework presented in this thesis analyzes symbolic sequential data sets by

learning normal behavior models during the training phase, and detecting abnormal

events in the testing phase. We divide the training phase into two stages: context

learning and normal behavior learning for each context. In the first stage, the symbolic

115

sequences in a dataset are clustered. In the second stage, a HMM is trained to model

each cluster. As in the training, the testing phase is also divided into two stages. In

the first stage, sequences are classified into one of the previously identified contexts.

In the second stage, the corresponding trained HMM is used to identify how likely

the test sequence can be considered normal.

Our results show that our context-aware anomaly detection approach when run

on symbolic sequences achieves significantly better detection performance for the

datasets tested in each domain. Furthermore, we addressed several issues in the design

of a practical anomaly detection system when working with symbolic sequences. We

considered the impact of context clustering, window size selection, and filtering on

our detection results.

This thesis makes contributions to the field of machine learning by proposing a

novel approach to parameter estimation for sequential data clustering during context

learning. This approach includes two steps: 1) estimating the number of natural

clusters, 2) estimating the minimum required subsequence length to be able to classify

a sequence accurately. In the first step, the idea is to identify similar sequences using

their frequency distributions to estimate the number of clusters in a dataset. In the

second step, the idea is to perform this natural clustering by computing the distances

between sequences with a combination of string similarity measures. The second step

estimates the number of symbols that are needed to perform accurate analysis on a

stream of sequential symbols. The advantages of performance context learning with

only a portion of a sequence can be seen in testing phase. It is critical to detect

anomalies while the event is occurring, both in cyber and physical security.

Next, we present future research directions that this work can take.

116

7.2 Future Work

In this dissertation, we have designed a context-aware anomaly detection framework

for symbolic sequences. We anticipate this thesis will spawn a number of future

research threads. First of all, our current work explores anomaly detection on two

application domains, but there are many other sequential data applications that could

benefit from our approach. We expect to see future work in DNA sequencing, spam

filtering, sales analytics, and social network analytics as future avenues of research on

sequential data.

Our work in unsupervised context learning depends on estimating two parame-

ters: the number of clusters k and the length of the required length l. The proposed

methodology is effective for grouping similar sequences. However, our approach comes

with a few limitations which can be improved in future work. First, the current

method estimates the number of clusters k, which can potentially hide the sequen-

tial nature of the data. Future work could also focus on estimation of both k and

l simultaneously. An improved method may estimate these parameters in an incre-

mental way by applying a model selection criterion such as BIC [124] or AIC (Akaike

information criterion) [112].

Second, the current method used to measure distances between sequences may

be prone to noise, since it is based on string-based similarity metrics. The approach

records exact position matches and the longest common consecutive subsequence.

Future work could include improving the distance computation method by considering

alignment-based similarity measures.

In future work, we also would like to investigate:

System call anomaly detection:

117

• Additional information: In our current work, each unique system call in a pro-

gram execution trace is represented with symbol from a finite set of symbols.

Future work may include extending our method by incorporating the system

call attributes. Considering these attributes would help to guide a more robust

context learning approach, and could lead to improved detection rates.

Crowd anomaly detection:

• Feature extraction: Our current feature extraction approach is based on detec-

tion of multiple objects and tracking those objects in every frame. The current

tracking methodology may be improved by investigating more sophisticated de-

tection and tracking algorithms. In order to detect abnormal events in highly

crowded scene, we may be able to extract holistic-based features without ap-

plying a detection methodology [101].

• Symbol representation: An issue we would face when applying our work in a

real-world setting would be the generation of symbolic representations of feature

histograms while testing with new data. In the current work, since we were

provided with training and testing data, we could apply clustering to assign

a symbol for each feature histogram in a data preparation step. Therefore,

we could generate symbolic sequences from the same symbolic alphabet for

both the training and test data. In a live system scenario, there would be

some feature histograms which are not very similar to any feature histogram

clusters. Although, this would be a sign of new event or an anomaly, we still

want to evaluate this sequence of symbols to detect abnormal events. There

are two ways to address this problem: 1) We can classify these new histograms

by assigning a symbol from the current alphabet by selecting the most similar

118

cluster. If the symbol assigned was not wrong, then this might degrade the entire

anomaly detection system since our approach depends on learning behaviors

from a sequence of symbols. 2) We could define a threshold value for symbol

assignment. If none of the histogram clusters are close to the incoming sample,

then create a new symbol. In this way, since there are new symbols in the test

sequence, anomaly detection system would directly produce an alarm. In this

case, a system administrator needs to decide if this event is an anomaly or a

new normal event which was missed during training. If the sample is deemed

normal, then the system need to retrain to adapt this new behavior. As a future

work, we want to improve our framework by considering this problem in a live

streaming scenario.

• Additional events: In our current work, the number of events is somewhat

limited, and depends on the benchmark dataset we have used. Future work

can explore how we can apply anomaly detection to a wider variety of events

occurring in sequential data.

119

Bibliography

[1] Adomavicius, G., and Tuzhilin, A. Context-aware recommender systems.

In Recommender systems handbook. Springer, 2011, pp. 217–253.

[2] Adomavicius, G., and Tuzhilin, A. Context-aware recommender systems.

In Recommender systems handbook. Springer, 2011, pp. 217–253.

[3] Aggarwal, C. C., and Zhai, C. A survey of text clustering algorithms. In

Mining Text Data. Springer, 2012, pp. 77–128.

[4] Aizawa, A. An information-theoretic perspective of tf–idf measures. Informa-

tion Processing & Management 39, 1 (2003), 45–65.

[5] Ali, S., and Shah, M. Floor fields for tracking in high density crowd scenes.

In Computer Vision–ECCV 2008. Springer, 2008, pp. 1–14.

[6] Aloysius, G., and Binu, D. An approach to products placement in super-

markets using prefixspan algorithm. Journal of King Saud University-Computer

and Information Sciences 25, 1 (2013), 77–87.

[7] Alpaydin, E. Introduction to machine learning. MIT press, 2004.

120

[8] Andrade, E., Blunsden, O., and Fisher, R. Performance analysis of

event detection models in crowded scenes. In Visual Information Engineering,

2006. VIE 2006. IET International Conference on (2006), IET, pp. 427–432.

[9] Andrade, E. L., Blunsden, S., and Fisher, R. B. Hidden markov models

for optical flow analysis in crowds. In Pattern Recognition, 2006. ICPR 2006.

18th International Conference on (2006), vol. 1, IEEE, pp. 460–463.

[10] Andrade, E. L., Blunsden, S., and Fisher, R. B. Modelling crowd

scenes for event detection. In Pattern Recognition, 2006. ICPR 2006. 18th

International Conference on (2006), vol. 1, IEEE, pp. 175–178.

[11] Axelsson, S. Intrusion detection systems: A survey and taxonomy. Tech.

rep., Technical report, 2000.

[12] Bace, R., and Mell, P. Nist special publication on intrusion detection

systems. Tech. rep., DTIC Document, 2001.

[13] Baldauf, M., Dustdar, S., and Rosenberg, F. A survey on context-

aware systems. International Journal of Ad Hoc and Ubiquitous Computing 2,

4 (2007), 263–277.

[14] Bardram, J. E. Applications of context-aware computing in hospital work:

examples and design principles. In Proceedings of the 2004 ACM symposium on

Applied computing (2004), ACM, pp. 1574–1579.

[15] Basharat, A., Gritai, A., and Shah, M. Learning object motion patterns

for anomaly detection and improved object detection. In Computer Vision and

Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (2008), IEEE,

pp. 1–8.

121

[16] Bazire, M., and Brezillon, P. Understanding context before using it. In

Modeling and using context. Springer, 2005, pp. 29–40.

[17] Boiman, O., and Irani, M. Detecting irregularities in images and in video.

International Journal of Computer Vision 74, 1 (2007), 17–31.

[18] Borji, A., Frintrop, S., Sihite, D. N., and Itti, L. Adaptive object

tracking by learning background context. In Computer Vision and Pattern

Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference

on (2012), IEEE, pp. 23–30.

[19] Bradley, A. P. The use of the area under the roc curve in the evaluation of

machine learning algorithms. Pattern recognition 30, 7 (1997), 1145–1159.

[20] Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Handbook of

Markov Chain Monte Carlo. Taylor & Francis US, 2011.

[21] Brostow, G. J., and Cipolla, R. Unsupervised bayesian detection of

independent motion in crowds. In Computer Vision and Pattern Recognition,

2006 IEEE Computer Society Conference on (2006), vol. 1, IEEE, pp. 594–601.

[22] Campbell, W., Campbell, J., Reynolds, D., Jones, D., and Leek, T.

Phonetic speaker recognition with support vector machines. In in Advances in

Neural Information Processing Systems (2004).

[23] Candamo, J., Shreve, M., Goldgof, D. B., Sapper, D. B., and

Kasturi, R. Understanding transit scenes: A survey on human behavior-

recognition algorithms. Intelligent Transportation Systems, IEEE Transactions

on 11, 1 (2010), 206–224.

122

[24] Caswell, B., Beale, J., and Baker, A. Snort Intrusion Detection and

Prevention Toolkit. Syngress, 2007.

[25] Challenger, R., Clegg, C., and Robinson, M. Understanding crowd

behaviours, vol. 1: Practical guidance and lessons identified. London: The

Stationery Office (TSO) (2010).

[26] Chan, A. B., Morrow, M., and Vasconcelos, N. Analysis of crowded

scenes using holistic properties. In Performance Evaluation of Tracking and

Surveillance workshop at CVPR (2009), pp. 101–108.

[27] Chan, A. B., and Vasconcelos, N. Mixtures of dynamic textures. In

Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on

(2005), vol. 1, IEEE, pp. 641–647.

[28] Chan, K., Lee, T.-W., Sample, P. A., Goldbaum, M. H., Weinreb,

R. N., and Sejnowski, T. J. Comparison of machine learning and traditional

classifiers in glaucoma diagnosis. Biomedical Engineering, IEEE Transactions

on 49, 9 (2002), 963–974.

[29] Chan, K.-P., and Fu, A.-C. Efficient time series matching by wavelets. In

Data Engineering, 1999. Proceedings., 15th International Conference on (1999),

IEEE, pp. 126–133.

[30] Chandola, V., Banerjee, A., and Kumar, V. Outlier detection: A

survey. Tech. Rep. 07-017, Department of Computer Science, University of

Minnesota, 2007.

[31] Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection: A

survey. ACM Computing Surveys (CSUR) 41, 3 (2009), 1–58.

123

[32] Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection for

discrete sequences: A survey. Knowledge and Data Engineering, IEEE Trans-

actions on 24, 5 (2012), 823–839.

[33] Chen, D.-Y., and Huang, P.-C. Motion-based unusual event detection in

human crowds. Journal of Visual Communication and Image Representation

22, 2 (2011), 178–186.

[34] Chen, W.-H., Hsu, S.-H., and Shen, H.-P. Application of svm and ann

for intrusion detection. Computers & Operations Research 32, 10 (2005), 2617–

2634.

[35] Cheung, S.-C. S., and Kamath, C. Robust techniques for background

subtraction in urban traffic video. In Proceedings of SPIE (2004), vol. 5308,

pp. 881–892.

[36] Chong, X., Liu, W., Huang, P., and Badler, N. I. Hierarchical crowd

analysis and anomaly detection. Journal of Visual Languages & Computing

(2013).

[37] Comaniciu, D., Ramesh, V., and Meer, P. Kernel-based object tracking.

Pattern Analysis and Machine Intelligence, IEEE Transactions on 25, 5 (2003),

564–577.

[38] Debar, H., Dacier, M., and Wespi, A. Towards a taxonomy of intrusion-

detection systems. Computer Networks 31, 8 (1999), 805–822.

[39] Debar, H., Dacier, M., and Wespi, A. A revised taxonomy for intrusion-

detection systems. In Annales des telecommunications (2000), vol. 55, Springer,

pp. 361–378.

124

[40] Dehghan, A., Idrees, H., Zamir, A. R., and Shah, M. Keynote: au-

tomatic detection and tracking of pedestrians in videos with various crowd

densities, 2011.

[41] Denning, D. E. An intrusion-detection model. Software Engineering, IEEE

Transactions on, 2 (1987), 222–232.

[42] Dollar, P. Piotr’s Image and Video Matlab Toolbox (PMT). http:

//vision.ucsd.edu/~pdollar/toolbox/doc/index.html.

[43] Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S. Behavior

recognition via sparse spatio-temporal features. In 2005 IEEE International

Workshop on Visual Surveillance and Performance Evaluation of Tracking and

Surveillance (2005), IEEE, pp. 65–72.

[44] Dollar, P., Wojek, C., Schiele, B., and Perona, P. Pedestrian detec-

tion: A benchmark. In CVPR (2009).

[45] Dollar, P., Wojek, C., Schiele, B., and Perona, P. Pedestrian detec-

tion: An evaluation of the state of the art. PAMI 34 (2012).

[46] Dourish, P. What we talk about when we talk about context. Personal and

ubiquitous computing 8, 1 (2004), 19–30.

[47] Du, Y., Wang, H., and Pang, Y. A hidden markov models-based anomaly

intrusion detection method. In Intelligent Control and Automation, 2004.

WCICA 2004. Fifth World Congress on (2004), vol. 5, IEEE, pp. 4348–4351.

125

[48] Ektefa, M., Memar, S., Sidi, F., and Affendey, L. S. Intrusion detec-

tion using data mining techniques. In Information Retrieval & Knowledge Man-

agement,(CAMP), 2010 International Conference on (2010), IEEE, pp. 200–

203.

[49] Ellis, A., Shahrokni, A., and Ferryman, J. M. Pets2009 and winter-pets

2009 results: A combined evaluation. In Performance Evaluation of Tracking

and Surveillance (PETS-Winter), 2009 Twelfth IEEE International Workshop

on (2009), IEEE, pp. 1–8.

[50] Eskin, E., Lee, W., and Stolfo, S. J. Modeling system calls for intru-

sion detection with dynamic window sizes. In DARPA Information Survivabil-

ity Conference & Exposition II, 2001. DISCEX’01. Proceedings (2001), vol. 1,

IEEE, pp. 165–175.

[51] Fawcett, T. An introduction to roc analysis. Pattern recognition letters 27,

8 (2006), 861–874.

[52] Ferryman, J., and Ellis, A. Pets2010: Dataset and challenge. In Advanced

Video and Signal Based Surveillance (AVSS), 2010 Seventh IEEE International

Conference on (2010), IEEE, pp. 143–150.

[53] Figueiredo, M. A., and Jain, A. K. Unsupervised learning of finite mixture

models. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24,

3 (2002), 381–396.

[54] Forrest, S., Hofmeyr, S. A., Somayaji, A., and Longstaff, T. A. A

sense of self for unix processes. In Security and Privacy, 1996. Proceedings.,

1996 IEEE Symposium on (1996), IEEE, pp. 120–128.

126

[55] Fraley, C., and Raftery, A. E. How many clusters? which clustering

method? answers via model-based cluster analysis. The computer journal 41,

8 (1998), 578–588.

[56] Fu, T.-c. A review on time series data mining. Engineering Applications of

Artificial Intelligence 24, 1 (2011), 164–181.

[57] Garate, C., Bilinsky, P., and Bremond, F. Crowd event recognition

using hog tracker. In Performance Evaluation of Tracking and Surveillance

(PETS-Winter), 2009 Twelfth IEEE International Workshop on (2009), IEEE,

pp. 1–6.

[58] Ge, W., Collins, R. T., and Ruback, R. B. Vision-based analysis of

small groups in pedestrian crowds. Pattern Analysis and Machine Intelligence,

IEEE Transactions on 34, 5 (2012), 1003–1016.

[59] Hariri, N., Mobasher, B., and Burke, R. Context-aware music recom-

mendation based on latenttopic sequential patterns. In Proceedings of the sixth

ACM conference on Recommender systems (2012), ACM, pp. 131–138.

[60] Heady, R., Luger, G., Maccabe, A., and Servilla, M. The architecture

of a network-level intrusion detection system. Department of Computer Science,

College of Engineering, University of New Mexico, 1990.

[61] Herranz, J., Nin, J., and Sole, M. Optimal symbol alignment distance: a

new distance for sequences of symbols. Knowledge and Data Engineering, IEEE

Transactions on 23, 10 (2011), 1541–1554.

127

[62] Hervieu, A., Bouthemy, P., and Le Cadre, J.-P. A statistical video

content recognition method using invariant features on object trajectories. Cir-

cuits and Systems for Video Technology, IEEE Transactions on 18, 11 (2008),

1533–1543.

[63] Hoang, X., and Hu, J. An efficient hidden markov model training scheme

for anomaly intrusion detection of server applications based on system calls. In

Networks, 2004.(ICON 2004). Proceedings. 12th IEEE International Conference

on (2004), vol. 2, IEEE, pp. 470–474.

[64] Hoang, X. D., Hu, J., and Bertok, P. A multi-layer model for anomaly

intrusion detection using program sequences of system calls. In Networks,

ICON2003. The 11th IEEE International Conference on (2003), IEEE, pp. 531–

536.

[65] Hodge, V. J., and Austin, J. A survey of outlier detection methodologies.

Artificial Intelligence Review 22, 2 (2004), 85–126.

[66] Hofmeyr, S. A., Forrest, S., and Somayaji, A. Intrusion detection using

sequences of system calls. Journal of computer security 6, 3 (1998), 151–180.

[67] Hong, J.-y., Suh, E.-h., and Kim, S.-J. Context-aware systems: A litera-

ture review and classification. Expert Systems with Applications 36, 4 (2009),

8509–8522.

[68] Hu, J., Yu, X., Qiu, D., and Chen, H.-H. A simple and efficient hidden

markov model scheme for host-based anomaly intrusion detection. Network,

IEEE 23, 1 (2009), 42–47.

128

[69] Hu, W., Liao, Y., and Vemuri, V. R. Robust anomaly detection using

support vector machines. In Proceedings of the international conference on

machine learning (2003), pp. 282–289.

[70] Hu, W., Xiao, X., Fu, Z., Xie, D., Tan, T., and Maybank, S. A

system for learning statistical motion patterns. Pattern Analysis and Machine

Intelligence, IEEE Transactions on 28, 9 (2006), 1450–1464.

[71] Jacques Junior, J. C. S., Musse, S. R., and Jung, C. R. Crowd analysis

using computer vision techniques. Signal Processing Magazine, IEEE 27, 5

(2010), 66–77.

[72] Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition

Letters 31, 8 (2010), 651 – 666.

[73] Jain, A. K., and Dubes, R. C. Algorithms for clustering data. Prentice-Hall,

Inc., 1988.

[74] Jiang, F., Wu, Y., and Katsaggelos, A. K. Abnormal event detection

based on trajectory clustering by 2-depth greedy search. In Acoustics, Speech

and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on

(2008), IEEE, pp. 2129–2132.

[75] Kang, D.-K., Fuller, D., and Honavar, V. Learning classifiers for misuse

and anomaly detection using a bag of system calls representation. In Informa-

tion Assurance Workshop, 2005. IAW’05. Proceedings from the Sixth Annual

IEEE SMC (2005), IEEE, pp. 118–125.

129

[76] Kaufman, L., and Rousseeuw, P. J. Finding groups in data: an introduc-

tion to cluster analysis. Wiley series in probability and mathematical statistics

(2005).

[77] Keogh, E., Lonardi, S., and Chiu, B.-c. Finding surprising patterns in a

time series database in linear time and space. In Proceedings of the eighth ACM

SIGKDD international conference on Knowledge discovery and data mining

(2002), ACM, pp. 550–556.

[78] Khreich, W., Granger, E., Sabourin, R., and Miri, A. Combining

hidden markov models for improved anomaly detection. In Communications,

2009. ICC’09. IEEE International Conference on (2009), IEEE, pp. 1–6.

[79] Kinde, I., Wu, J., Papadopoulos, N., Kinzler, K. W., and Vogel-

stein, B. Detection and quantification of rare mutations with massively paral-

lel sequencing. Proceedings of the National Academy of Sciences 108, 23 (2011),

9530–9535.

[80] Klaser, A., Marsza lek, M., Schmid, C., and Zisserman, A. Human

focused action localization in video. In Trends and Topics in Computer Vision.

Springer, 2012, pp. 219–233.

[81] Kohavi, R., et al. A study of cross-validation and bootstrap for accuracy

estimation and model selection. In IJCAI (1995), vol. 14, pp. 1137–1145.

[82] Kratz, L., and Nishino, K. Anomaly detection in extremely crowded scenes

using spatio-temporal motion pattern models. In Computer Vision and Pattern

Recognition, 2009. CVPR 2009. IEEE Conference on (2009), IEEE, pp. 1446–

1453.

130

[83] Kratz, L., and Nishino, K. Going with the flow: pedestrian efficiency in

crowded scenes. In Computer Vision–ECCV 2012. Springer, 2012, pp. 558–572.

[84] Kratz, L., and Nishino, K. Tracking pedestrians using local spatio-temporal

motion patterns in extremely crowded scenes. Pattern Analysis and Machine


[85] Kruegel, C., Mutz, D., Valeur, F., and Vigna, G. On the detection

of anomalous system call arguments. In Computer Security–ESORICS 2003.

Springer, 2003, pp. 326–343.

[86] Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., and Mori, G.

Discriminative latent models for recognizing contextual group activities. Pattern

Analysis and Machine Intelligence, IEEE Transactions on 34, 8 (2012), 1549–

1562.

[87] Lee, W., and Xiang, D. Information-theoretic measures for anomaly de-

tection. In Security and Privacy, 2001. S&P 2001. Proceedings. 2001 IEEE

Symposium on (2001), IEEE, pp. 130–143.

[88] Liao, Y., and Vemuri, V. R. Use of k-nearest neighbor classifier for intrusion

detection. Computers & Security 21, 5 (2002), 439–448.

[89] Liao, Y., and Vemuri, V. R. Using text categorization techniques for in-

trusion detection. In USENIX Security Symposium (2002), vol. 12, pp. 51–59.

[90] Lin, J., Keogh, E., Lonardi, S., and Patel, P. Finding motifs in time

series. In Proc. of the 2nd Workshop on Temporal Data Mining (2002), pp. 53–

68.

131

[91] Liu, G., McDaniel, T. K., Falkow, S., and Karlin, S. Sequence anoma-

lies in the cag7 gene of the helicobacter pylori pathogenicity island. Proceedings

of the National Academy of Sciences 96, 12 (1999), 7011–7016.

[92] Liu, X., and Aberer, K. Soco: a social network aided context-aware recom-

mender system. In Proceedings of the 22nd international conference on World

Wide Web (2013), International World Wide Web Conferences Steering Com-

mittee, pp. 781–802.

[93] Lobo, J. M., Jimenez-Valverde, A., and Real, R. Auc: a misleading

measure of the performance of predictive distribution models. Global ecology

and Biogeography 17, 2 (2008), 145–151.

[94] Maggi, F., Matteucci, M., and Zanero, S. Detecting intrusions through

system call sequence and argument analysis. Dependable and Secure Computing,


[95] Mahadevan, V., Li, W., Bhalodia, V., and Vasconcelos, N. Anomaly

detection in crowded scenes. In Computer Vision and Pattern Recognition

(CVPR), 2010 IEEE Conference on (2010), IEEE, pp. 1975–1981.

[96] Markou, M., and Singh, S. Novelty detection: a review - part 1: statistical

approaches. Signal processing 83, 12 (2003), 2481–2497.

[97] Markou, M., and Singh, S. Novelty detection: a review - part 2:: neural

network based approaches. Signal processing 83, 12 (2003), 2499–2521.

[98] Marques, J. S., Jorge, P. M., Abrantes, A. J., and Lemos, J. Tracking

groups of pedestrians in video sequences. In Computer Vision and Pattern

132

Recognition Workshop, 2003. CVPRW’03. Conference on (2003), vol. 9, IEEE,

pp. 101–101.

[99] Maruster, L., Weijters, A. T., van der Aalst, W. W., and van den

Bosch, A. Process mining: Discovering direct successors in process logs. In

Discovery Science (2002), Springer, pp. 364–373.

[100] Maxion, R. A., and Roberts, R. R. Proper use of ROC curves in In-

trusion/Anomaly Detection. University of Newcastle upon Tyne, Computing

Science, 2004.

[101] Mehran, R., Oyama, A., and Shah, M. Abnormal crowd behavior detec-

tion using social force model. In Computer Vision and Pattern Recognition,

2009. CVPR 2009. IEEE Conference on (2009), IEEE, pp. 935–942.

[102] Montgomery, D. C., Jennings, C. L., and Kulahci, M. Introduction to

time series analysis and forecasting, vol. 526. John Wiley & Sons, 2011.

[103] Mutz, D., Valeur, F., Vigna, G., and Kruegel, C. Anomalous sys-

tem call detection. ACM Transactions on Information and System Security

(TISSEC) 9, 1 (2006), 61–93.

[104] Ng, B. Survey of anomaly detection methods. United States. Department of

Energy, 2006.

[105] Oliver, N. M., Rosario, B., and Pentland, A. P. A bayesian computer

vision system for modeling human interactions. Pattern Analysis and Machine


133

[106] Park, S., and Trivedi, M. M. Homography-based analysis of people and

vehicle activities in crowded scenes. In Applications of Computer Vision, 2007.

WACV’07. IEEE Workshop on (2007), IEEE, pp. 51–51.

[107] Parvathy, R., Thilakan, S., Joy, M., and Sameera, K. Anomaly detec-

tion using motion patterns computed from optical flow. In Advances in Com-

puting and Communications (ICACC), 2013 Third International Conference on

(2013), IEEE, pp. 58–61.

[108] Patcha, A., and Park, J.-M. An overview of anomaly detection techniques:

Existing solutions and latest technological trends. Computer Networks 51, 12

(2007), 3448–3470.

[109] Pelleg, D., and Moore, A. W. X-means: Extending k-means with efficient

estimation of the number of clusters. In ICML (2000), pp. 727–734.

[110] Pelleg, D., and Moore, A. W. Active learning for anomaly and rare-

category detection. In Advances in Neural Information Processing Systems

(2005), pp. 1073–1080.

[111] PETS. Performance evaluation of tracking and surveillance 2009 benchmark

data, 2009. [Online; accessed 08-March-2014].

[112] Posada, D., and Buckley, T. R. Model selection and model averaging

in phylogenetics: advantages of akaike information criterion and bayesian ap-

proaches over likelihood ratio tests. Systematic biology 53, 5 (2004), 793–808.

[113] Qayyum, A., Islam, M., and Jamil, M. Taxonomy of statistical based

anomaly detection techniques for intrusion detection. In Emerging Technologies,

2005. Proceedings of the IEEE Symposium on (2005), IEEE, pp. 270–276.

134

[114] Rabiner, L. R. A tutorial on hidden markov models and selected applications

in speech recognition. Proceedings of the IEEE 77, 2 (1989), 257–286.

[115] Reisman, P., Mano, O., Avidan, S., and Shashua, A. Crowd detection in

video sequences. In Intelligent Vehicles Symposium, 2004 IEEE (2004), IEEE,

pp. 66–71.

[116] Reynolds, A. P., Richards, G., and Rayward-Smith, V. J. The appli-

cation of k-medoids and pam to the clustering of rules. 173–178.

[117] Roberts, S. W. Control chart tests based on geometric moving averages.

Technometrics 1, 3 (1959), pp. 239–250.

[118] Rosemann, M., Recker, J., and Flender, C. Contextualisation of busi-

ness processes. International Journal of Business Process Integration and Man-

agement 3, 1 (2008), 47–60.

[119] Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and

validation of cluster analysis. Journal of computational and applied mathematics

20 (1987), 53–65.

[120] Salem, M. B., Hershkop, S., and Stolfo, S. J. A survey of insider

attack detection research. In Insider Attack and Cyber Security. Springer, 2008,

pp. 69–90.

[121] Salton, G., and Buckley, C. Term-weighting approaches in automatic text

retrieval. Information processing & management 24, 5 (1988), 513–523.

135

[122] Santos, A. C., Cardoso, J. M., Ferreira, D. R., Diniz, P. C., and

Chaınho, P. Providing user context for mobile and social networking appli-

cations. Pervasive and Mobile Computing 6, 3 (2010), 324–341.

[123] Scarfone, K., and Mell, P. Guide to intrusion detection and prevention

systems (idps). NIST Special Publication 800, 2007 (2007), 94.

[124] Schwarz, G. Estimating the dimension of a model. The annals of statistics

6, 2 (1978), 461–464.

[125] Shameli-Sendi, A., Ezzati-Jivan, N., Jabbarifar, M., and Dagenais,

M. Intrusion response systems: survey and taxonomy. SIGMOD Rec 12 (2012),

1–14.

[126] Sharma, N., Sharma, P., Irwin, D., and Shenoy, P. Predicting so-

lar generation from weather forecasts using machine learning. In Smart Grid

Communications (SmartGridComm), 2011 IEEE International Conference on

(2011), IEEE, pp. 528–533.

[127] Solmaz, B., Moore, B. E., and Shah, M. Identifying behaviors in crowd

scenes using stability analysis for dynamical systems. Pattern Analysis and

Machine Intelligence, IEEE Transactions on 34, 10 (2012), 2064–2070.

[128] Stauffer, C., and Grimson, W. E. L. Learning patterns of activity using

real-time tracking. Pattern Analysis and Machine Intelligence, IEEE Transac-

tions on 22, 8 (2000), 747–757.

[129] Steinbach, M., Karypis, G., Kumar, V., et al. A comparison of docu-

ment clustering techniques. In KDD workshop on text mining (2000), vol. 400,

pp. 525–526.

136

[130] Tan, K. M., and Maxion, R. A. ” why 6?” defining the operational limits

of stide, an anomaly-based intrusion detector. In Security and Privacy, 2002.

Proceedings. 2002 IEEE Symposium on (2002), IEEE, pp. 188–201.

[131] Tan, K. M., and Maxion, R. A. Determining the operational limits of an

anomaly-based intrusion detector. Selected Areas in Communications, IEEE

Journal on 21, 1 (2003), 96–110.

[132] Tandon, G., and Chan, P. Learning rules from system call arguments

and sequences for anomaly detection. In ICDM Workshop on Data Mining for

Computer Security (DMSEC) (2003), pp. 20–29.

[133] Thida, M., Eng, H.-L., Dorothy, M., and Remagnino, P. Learning

video manifold for segmenting crowd events and abnormality detection. In

Computer Vision–ACCV 2010. Springer, 2011, pp. 439–449.

[134] Tsai, C.-F., Hsu, Y.-F., Lin, C.-Y., and Lin, W.-Y. Intrusion detection

by machine learning: A review. Expert Systems with Applications 36, 10 (2009),

11994–12000.

[135] Tu, Z. Auto-context and its application to high-level vision tasks. In Com-

puter Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on

(2008), IEEE, pp. 1–8.

[136] UNM. Unm system call dataset, 2013. [Online; accessed 28-November-2013].

[137] Varghese, S. M., and Jacob, K. P. Process profiling using frequencies of

system calls. In Availability, Reliability and Security, 2007. ARES 2007. The

Second International Conference on (2007), IEEE, pp. 473–479.

137

[138] Vaswani, N., Roy-Chowdhury, A. K., and Chellappa, R. ” shape ac-

tivity”: a continuous-state hmm for moving/deforming shapes with application

to abnormal activity detection. Image Processing, IEEE Transactions on 14,

10 (2005), 1603–1616.

[139] Wang, S., and Miao, Z. Anomaly detection in crowd scene. In Signal

Processing (ICSP), 2010 IEEE 10th International Conference on (2010), IEEE,

pp. 1220–1223.

[140] Wang, T., and Snoussi, H. Histograms of optical flow orientation for abnor-

mal events detection. In Performance Evaluation of Tracking and Surveillance

(PETS), 2013 IEEE International Workshop on (2013), IEEE, pp. 45–52.

[141] Wang, W., Zhang, X., and Gombault, S. Constructing attribute weights

from computer audit data for effective intrusion detection. Journal of Systems

and Software 82, 12 (2009), 1974–1981.

[142] Wang, X., et al. Learning motion patterns using hierarchical bayesian mod-

els. PhD thesis, Massachusetts Institute of Technology, 2009.

[143] Wang, X., Ma, K. T., Ng, G.-W., and Grimson, W. E. L. Trajec-

tory analysis and semantic region modeling using nonparametric hierarchical

bayesian models. International journal of computer vision 95, 3 (2011), 287–

312.

[144] Wang, X., Ma, X., and Grimson, E. Unsupervised activity perception

by hierarchical bayesian models. In Computer Vision and Pattern Recognition,

2007. CVPR’07. IEEE Conference on (2007), IEEE, pp. 1–8.

138

[145] Wang, X., Tieu, K., and Grimson, E. Learning semantic scene models by

trajectory analysis. In Computer Vision–ECCV 2006. Springer, 2006, pp. 110–

123.

[146] Warrender, C., Forrest, S., and Pearlmutter, B. Detecting intru-

sions using system calls: Alternative data models. In Security and Privacy,

1999. Proceedings of the 1999 IEEE Symposium on (1999), IEEE, pp. 133–145.

[147] Xing, Z., Pei, J., and Keogh, E. A brief survey on sequence classification.

ACM SIGKDD Explorations Newsletter 12, 1 (2010), 40–48.

[148] Xiong, G., Cheng, J., Wu, X., Chen, Y.-L., Ou, Y., and Xu, Y.

An energy model approach to people counting for abnormal crowd behavior

detection. Neurocomputing 83 (2012), 121–135.

[149] Xu, R., and Wunsch, D. Survey of clustering algorithms. Neural Networks,


[150] Yang, J., Vela, P., Shi, Z., and Teizer, J. Probabilistic multiple people

tracking through complex situations. In 11th IEEE International Workshop on

Performance Evaluation of Tracking and Surveillance (2009).

[151] Ye, N., and Chen, Q. An anomaly detection technique based on a chi-

square statistic for detecting intrusions into information systems. Quality and

Reliability Engineering International 17, 2 (2001), 105–112.

[152] Yeung, D.-Y., and Chow, C. Parzen-window network intrusion detectors.

In Pattern Recognition, 2002. Proceedings. 16th International Conference on

(2002), vol. 4, IEEE, pp. 385–388.

139

[153] Yeung, D.-Y., and Ding, Y. Host-based intrusion detection using dynamic

and static behavioral models. Pattern recognition 36, 1 (2003), 229–243.

[154] Yilmaz, A., Javed, O., and Shah, M. Object tracking: A survey. Acm

computing surveys (CSUR) 38, 4 (2006), 13.

[155] Yolacan, E. N., Dy, J. G., and Kaeli, D. R. System call anomaly detec-

tion using multi-hmms. In Software Security and Reliability-Companion (SERE-

C), 2014 IEEE Eighth International Conference on (2014), IEEE, pp. 25–30.

[156] Yue, W. T., and Cakanyıldırım, M. A cost-based analysis of intrusion de-

tection system configuration under active or passive response. Decision Support

Systems 50, 1 (2010), 21–31.

[157] Zamboni, D., et al. Using internal sensors for computer intrusion detection.

Center for Education and Research in Information Assurance and Security,

Purdue University (2001).

[158] Zaraska, K. Ids active response mechanisms: Countermeasure subsytem for

prelude ids. Tech. rep., Citeseer, 2002.

[159] Zhan, B., Monekosso, D. N., Remagnino, P., Velastin, S. A., and

Xu, L.-Q. Crowd analysis: a survey. Machine Vision and Applications 19, 5-6

(2008), 345–357.

[160] Zhang, Y., Ge, W., Chang, M.-C., and Liu, X. Group context learning

for event recognition. In Applications of Computer Vision (WACV), 2012 IEEE

Workshop on (2012), IEEE, pp. 249–255.

140

[161] Zhang, Z., and Li, M. Crowd density estimation based on statistical analysis

of local intra-crowd motions for public area surveillance. Optical Engineering

51, 4 (2012), 047204–1.

[162] Zhang, Z., and Shen, H. Application of online-training svms for real-time

intrusion detection with different considerations. Computer Communications

28, 12 (2005), 1428–1442.

[163] Zhao, Y., and Karypis, G. Evaluation of hierarchical clustering algorithms

for document datasets. In Proceedings of the eleventh international conference

on Information and knowledge management (2002), ACM, pp. 515–524.

[164] Zhao, Y., and Karypis, G. Empirical and theoretical comparisons of selected

criterion functions for document clustering. Machine Learning 55, 3 (2004),

311–331.

[165] Zhou, B., Wang, X., and Tang, X. Understanding collective crowd be-

haviors: Learning a mixture model of dynamic pedestrian-agents. In Computer

Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (2012),

IEEE, pp. 2871–2878.

[166] Zhu, Y., Nayak, N. M., and Roy-Chowdhury, A. K. Context-aware

activity recognition and anomaly detection in video. Selected Topics in Signal

Processing, IEEE Journal of 7, 1 (2013), 91–101.

[167] Zhu, Y., Nayak, N. M., and Roy-Chowdhury, A. K. Context-aware

modeling and recognition of activities in video. In Computer Vision and Pattern

Recognition (CVPR), 2013 IEEE Conference on (2013), IEEE, pp. 2491–2498.

141

learning from sequential data for anomaly detection349793/fulltext.pdflearning from sequential data...

Documents