1 anomaly detection in categorical datasets kaustav das, jeff schneider machine learning department...

1

Anomaly detection in Categorical DatasetsKaustav Das, Jeff Schneider

Machine Learning DepartmentCarnegie Mellon University

2

Outline Problem Motivation/Overview Related Work Conditional Anomaly Marginal Anomaly Datasets and Results

3

Problem Motivation

Import of Containers

Astronomical Data Emergency Department

Network Intrusion Detection

Detect anomalous records in large amount of record based data

4

Problem Overview

Training Data: Categorical dataset – categorize real values. Large number of records – 100,000 to 1 million. Unlabelled: A small fraction of them (<1-2%) can be anomalous. Attributes can have high arity of up to 5000-10,000.

Test Data: Same properties as above. Can have any fraction of anomalous records.

Goal: To detect records in the test set that are ‘anomalous’. More generally, score each test record with the degree of anomalousness. Flag records based on the desired false positive rate.

5

Problem Overview

FPORT USPORT COUNTRY SLINE VESSEL SHIPPER NAME F NAME COMMODITY SIZE MTONS VALUE

YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE AMERICAN_TRI_NET_EXPRESSTRI_NET EMPTY_RACK 0 5.6 27579YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OF_SHIPPERUSED_TIRE 2 13.43 9497YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OF_SHIPPERUSED_TIRE 2 13.43 9497YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE AMERICAN_TRI_NET_EXPRESSTRI_NET CRUDE_IODINE_PURITY 1 17.68 251151YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OF_SHIPPERUSED_TIRES 2 13.43 9497YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE CHINA_OCEAN_SHPG CHINA_OCEAN_SHPG_AGENCYEMPTY_CONTAINERS 0 0 0YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE CHINA_OCEAN_SHPG CHINA_OCEAN_SHPG_AGENCYEMPTY_CONTAINERS 0 0 0

Example Dataset – PIERS Data

6

Related Work Likelihood Based Methods

Dependency Trees [Pelleg ’04] Bayes Network

Network Intrusion Detection [Ye and Xu ’00; Bronstein et al. ’01] Malicious Email Detection [Shih et al. ’04] Disease Outbreak Detection [Wong et al. ’03]

Learn a probability distribution model from training data. Anomalies: Test set records having unusually low likelihood in the learnt model.

7

Related Work Likelihood Based Methods

Dependency Trees [Pelleg ’04] Bayes Network

Network Intrusion Detection [Ye and Xu ’00; Bronstein et al. ’01] Malicious Email Detection [Shih et al. ’04] Disease Outbreak Detection [Wong et al. ’03]

Association Rule Learners LERAD [Chan et al. ’06]

Learn rules of the form X → Y Anomaly score depends on P(¬Y|X)

Hidden Association Rules [Banderas et al. ’05]

Learn a probability distribution model from training data. Anomalies: Test set records having unusually low likelihood in the learnt model.

8

Outline Problem Motivation/Overview Related Work Conditional Anomaly

Motivation/Definition Algorithm for Testing Records Estimating Probability Values Speedup Tricks

Marginal Anomaly Datasets and Results

9

Conditional Anomaly Suppose, P(Commodity|Country) is a factor in the Bayes network. In test record t: Commodity = Gold, Country = China

P(Gold|China) = 0.001 P(Gold) = 0.001

Is this an anomaly?

10

Suppose, P(Commodity|Country) is a factor in the Bayes network. In test record t: Commodity = Gold, Country = China


Conditional Anomaly

P(Gold)

China)|P(Gold)Country,yr(Commodit tt 1

0.001

0.001

Normalize

11

Suppose, P(Commodity|Country) is a factor in the Bayes network. In test record t: Commodity = Gold, Country = China


Commodity = Copper, Country = China P(Copper|China) = 0.001 P(Copper) = 0.1

Conditional Anomaly

P(Copper)

China)|P(Copper)Country,yr(Commodit tt 01.

0.1

0.001

P(Gold)

China)|P(Gold)Country,yr(Commodit tt 1

0.001

0.001

12

Conditional Anomaly

A small r value denotes that there is a strong negative dependence between the occurrence of values at and bt.

at and bt co-occurring in t and r(at , bt) << 1 t is Anomalous !!

))P(bP(a

)b,P(a

)P(b

)a|P(b

)P(a

)b|P(a)b,r(a

tt

tt

t

tt

t

tttt

r value is defined over two attribute values at and bt of attributes A and B in test record t.

13

Conditional Anomaly

))P(bP(a

)b,P(a

)P(b

)a|P(b

)P(a

)b|P(a)b,r(a

tt

tt

t

tt

t

tttt

r value is defined over two attribute values at and bt of attributes A and B in test record t.


at and bt co-occurring in t and r(at , bt) << 1 t is Anomalous !! In general, we can consider sets of attributes A and B to calculate r. For example,

We limit the number of attributes in each set to k. Hence, we consider up to 2k attributes at a time.

)Country,Port )P(USyP(Commodit

)Commodity,Port US,P(Country)b,r(a

Country}Port, {USB

}{CommodityA

ttt

ttttt

14

Conditional Anomaly – Algorithm Algorithm for testing record t

For each mutually exclusive pair of attribute sets {A, B} compute:

Score the record t based on all the r values. Heuristic 1: Assign the minimum r value as the score Heuristic 2 (Combining evidence) : Combine evidence from other subsets of

attributes by taking product of r values.

))P(bP(a

)b,P(a)b,r(a

tt

tttt Exponential number

of r values: O(m2k)

15

Estimating Probability Values

Maximum Likelihood Estimation:

C(at) : Number of training instances having A=at

N : Total number of training cases.

Laplace Smoothing: Let p = P(at)

2N

1)C(aE[p] t

)C(b

N

)C(a

N

N

)b,C(a

))P(bP(a

)b,P(a

tt

tt

tt

tt

)b,r(a tt 1)C(b

2N

1)C(a

2N

2N

1)b,C(a

tt

tt

16

Speedup Trick 1: Rare values of attributes can be ignored

Decrease the arity of attributes: Replace all rare values with a generic rare value.

α)b,r(a tt

α

1)C(b&

α

1)C(a tt


17

Speedup Trick 2: To estimate counts from the training data, use a very efficient caching data-structure: AD-Tree [Moore and Lee ’98]

Naïve method is O(N). AD -Tree pre-computes the values of most queries, and requires a small

computation for some queries. It is independent of N. Construct an AD-Tree on the reduced arity attributes.


18


19

Marginal Anomaly What about rare values?

Import of Plutonium Import of $1 million worth

What is the probability of seeing something this rare or rarer? Consider attribute set A of up to k attributes.

)}()(:{where,)()( ta

it aAPxAPxaPaqvali

XX

20

Marginal Anomaly What about rare values?

Import of Plutonium Import of $1 million worth

What is the probability of seeing something this rare or rarer? Consider attribute set A of up to k attributes.

47.0qval 01.0qval

Values of attribute A

Pro

ba

bili

ty o

f o

ccu

rre

nce

Pro

ba

bili

ty o

f o

ccu

rre

nce

Values of attribute B

a1 a2 a3 a4 a5a49 a50 b1 b2 b3 b4 b5

b6 b7 b8b9

)}()(:{where,)()( ta

it aAPxAPxaPaqvali

XX

21


22

Datasets PIERS Dataset

Attribute Arity

1 Country 22

2 Foreign Port 42

3 US Port 16

4 Shipping Line 4

5 Shipper Name 4218

6 Importer Name 6412

7 Commodity Description 1649

8 Size 5

9 Weight 5

10 Value 5

23


Attribute Arity

1 Country 22

2 Foreign Port 42

3 US Port 16

4 Shipping Line 4

5 Shipper Name 4218

6 Importer Name 6412

7 Commodity Description 1649

8 Size 5

9 Weight 5

10 Value 5

24


No labeled anomalies. Anomaly Generation – Method 1

Select a test record to be modified. Randomly choose an attribute. Flip the value of the chosen attribute, drawing from the attribute marginal.

Anomaly Generation – Method 2 Insert records from a different time-period.

100,000 training records and 10,000 test records. 10% of test record are generated to be anomalous.

25

Results: PIERS Dataset

Performance of methods for random attribute flips

The proportion of true anomalies detected.

Positives#

PositivesTrue#

26

Results: PIERS Dataset

Performance of methods for records inserted from different month

27

Datasets KDD Cup 99 Dataset

Records correspond to individual network sessions. Features:

Basic features of an individual TCP connection: duration protocol type number of bytes transferred, etc.

Features obtained using some domain knowledge: number of file creation operations number of failed login attempts, etc.

Features computed using a two second time window: number of connections to the same service, etc.

In total there are 41 features, most of them taking continuous values. Discretized to 5 levels.

We selected six different attack types: apache2, guess password, mailbomb, neptune, snmpguess and

snmpgetattack.

28

Results: KDD Cup 99

Comparison of performance for Conditional and Bayes Net methods on KDD Cup dataset

29

Summary Detecting anomalies based on learning single probability distribution

model and computing whole record likelihoods is problematic: High arity leads to detecting rare attribute values. The signal in some features gets washed out in the noise of the rest of the

features. Anomalies highlight mistakes in model learning.

We propose new approaches to solve this: Considering all subsets of features up to some size. Define r-values which can indicate anomalies arising out of co-occurrence of

high negatively correlated values. Empirical results on real data sets demonstrate improved anomaly detection. The time and memory requirements for our algorithms is comparable to that

of the baseline methods.

30

Thank You!

Please visit poster board #39

31

Time and Memory RequirementsDataset Training

Size Test Size Number of

AttributesTraining

Time (secs)Testing

Time (secs)Memory

(MB)

CBP 100,000 10,000 10 6.9 4.7 4.5

KDD Cup 99 100,000 10,000 41 297 1.6 152

Dataset Number of Attributes

k Training Time (secs)

Testing Time (secs)

Memory (MB)

Marginal Memory (MB)

CBP 10 1 7.6 16.8 337 334

2 7.8 133 338 340

3 9.3 790 341 489

KDD Cup 99 41 1 10.2 15 323 222

2 44 7145 332 2618

Table 2: Time and Space requirement for Conditional and Marginal Methods

Table 1: Time and Space requirement for Bayes Network Method

32

Related Work Supervised classification based approaches

Decision Trees [Lee et al. ’98] Neural Network [Ghosh et al. ’99] SVMs [Li et al. ’03; Shon et al. ’05] Sequence Analysis [Hofmeyr et al.’98; Helman ‘97]

Unsupervised approaches applied to real valued data k-NN [Yang and Liu ’99] Clustering [Eskin ’02] GMM [Roberts and Tarassenko ’94]

33

Speedup Trick 1: Rare values of attributes can be ignored

α)b,r(a tt

α1)C(b

2N

1)C(a

2N

2N

1)b,C(a

tt

tt

α1)C(a

1)b,C(a

t

tt

α1)C(a

1

t

1α

1)C(a t

α

1)C(a t


34

Marginal Anomaly Detection In test record t, the attribute combination A has value at.

Compute C(at), and C(ai) for all values of ai of A which are rarer than at. For each attribute set A, precompute:

Histogram function, hA(i) : number of values of A that occur i times in training data.

Cumulative histogram, chA(i) : number of records having values of A that occur i times or less in the training data.

Construct a non-reduced AD-Tree on the training data.

i

jA h(j)j(i)ch

1

N

aCchapval tA

t

))(()(

35

Marginal Anomaly Detection Testing Algorithm for record t

For each composite attribute A compute:

Assign the minimum p-value as the score of record t

N

aCchapval tA

t

))(()(

36

Grouping Group (cluster) the records according to some similarity measure. Find

clusters in data with unusual concentration of anomalies. k-NN k-means (modes) GDA (link detection)

Define a collection of groups independent of data. Search over all possible groups to find unusual concentration of anomalies WSARE Scan Stats

37

Grouping Group (cluster) the records according to some similarity measure. Find clusters

in data with unusual concentration of anomalies (Dis)similarity measure: Hamming Distance

Generative model:

XXT5

XXT4

XXXT3

XXT2

XXT1

G6G5G4G3G2G1

P0(d); PA(d)

03742T5

30663T4

76073T3

46701T2

23310T1

T5T4T3T2T1

Group Chart (G)

Distribution of distances

N x N Hamming distances

38



Generative model:

ji

jiTTIG

TTDPjiG

),(max ),(

XXT5

XXT4

XXXT3

XXT2

XXT1

G6G5G4G3G2G1

P0(d); PA(d)

03742T5

30663T4

76073T3

46701T2

23310T1

T5T4T3T2T1

Group Chart (G)

Distribution of distances

N x N Hamming distances

39



Likelihood ratio – Null Hypothesis: Attribute values belong to the training distribution Alternate Hypothesis: Attribute values come from a different distribution

Include records Ti in G to maximize LRG.

Constraints Include all records within a distance R of the cluster center (leader).

GTiA

GTi

G

i

i

TL

TL

LR

40

Grouping Define a collection of groups independent of data. Search over all possible

groups to find unusual concentration of anomalies All records with matching value(s) of particular attribute(s). Detection method:

Fisher’s exact test on:

Low values of the ratio:

Mtest(A=a) Ntest(A=a)

Mtrain(A=a) Ntrain(A=a)

testtest

traintrain

NM

NMR

/

/

41

Grouping Detection method:

Fisher’s exact test on:

Low values of the ratio:



testtest

traintrain

NM

NMR

/

/

Examples:

group no: 1; Score: 3.508139e-01Train Total: 1065; Train Anom: 19; Test Total: 132; Test Anom: 81:LOS_ANGELES, 7:v4:15+, 9:v4:500000+,

group no: 2; Score: 4.409127e-01Train Total: 985; Train Anom: 32; Test Total: 105; Test Anom: 150:YANTIAN, 3:MLSL, 4:CHASTINE_MAERSK,

group no: 3; Score: 4.409127e-01Train Total: 985; Train Anom: 32; Test Total: 105; Test Anom: 150:YANTIAN, 4:CHASTINE_MAERSK,

42

Comparisons

Comparison of performance by grouping anomalies

43

Comparisons WSARE

Scan Statistics Real valued geographic coordinates.



Ntest(A=a) Ntest(A≠a)

Ntrain(A=a) Ntrain(A≠a)

44

Conditional Anomaly


When we observe at and bt co-occurring in a record t and r has an unusually low value, we conclude that this is an anomaly.

In general, we can consider sets of attributes A and B to calculate r. For example,

We limit the number of attributes in each set to k. Hence, we consider up to 2k attributes at a time.

))P(bP(a

)b,P(a

)P(b

)a|P(b

)P(a

)b|P(a)b,r(a

tt

tt

t

tt

t

tttt

)Country,Port )P(USyP(Commodit

)Commodity,Port US,P(Country)b,r(a

Country}Port, {USB

}{CommodityA

ttt

ttttt

1 anomaly detection in categorical datasets kaustav das, jeff schneider machine learning department...

Documents

data slide

anomaly detection

problem overview training

categorical datasets

large number of records

anomalous records

real values

small fraction