1 anomaly detection in categorical datasets kaustav das, jeff schneider machine learning department...

44
1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

Upload: beverley-mckenzie

Post on 23-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

1

Anomaly detection in Categorical DatasetsKaustav Das, Jeff Schneider

Machine Learning DepartmentCarnegie Mellon University

Page 2: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

2

Outline Problem Motivation/Overview Related Work Conditional Anomaly Marginal Anomaly Datasets and Results

Page 3: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

3

Problem Motivation

Import of Containers

Astronomical Data Emergency Department

Network Intrusion Detection

Detect anomalous records in large amount of record based data

Page 4: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

4

Problem Overview

Training Data: Categorical dataset – categorize real values. Large number of records – 100,000 to 1 million. Unlabelled: A small fraction of them (<1-2%) can be anomalous. Attributes can have high arity of up to 5000-10,000.

Test Data: Same properties as above. Can have any fraction of anomalous records.

Goal: To detect records in the test set that are ‘anomalous’. More generally, score each test record with the degree of anomalousness. Flag records based on the desired false positive rate.

Page 5: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

5

Problem Overview

FPORT USPORT COUNTRY SLINE VESSEL SHIPPER NAME F NAME COMMODITY SIZE MTONS VALUE

YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE AMERICAN_TRI_NET_EXPRESSTRI_NET EMPTY_RACK 0 5.6 27579YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OF_SHIPPERUSED_TIRE 2 13.43 9497YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OF_SHIPPERUSED_TIRE 2 13.43 9497YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE AMERICAN_TRI_NET_EXPRESSTRI_NET CRUDE_IODINE_PURITY 1 17.68 251151YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OF_SHIPPERUSED_TIRES 2 13.43 9497YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE CHINA_OCEAN_SHPG CHINA_OCEAN_SHPG_AGENCYEMPTY_CONTAINERS 0 0 0YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE CHINA_OCEAN_SHPG CHINA_OCEAN_SHPG_AGENCYEMPTY_CONTAINERS 0 0 0

Example Dataset – PIERS Data

Page 6: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

6

Related Work Likelihood Based Methods

Dependency Trees [Pelleg ’04] Bayes Network

Network Intrusion Detection [Ye and Xu ’00; Bronstein et al. ’01] Malicious Email Detection [Shih et al. ’04] Disease Outbreak Detection [Wong et al. ’03]

Learn a probability distribution model from training data. Anomalies: Test set records having unusually low likelihood in the learnt model.

Page 7: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

7

Related Work Likelihood Based Methods

Dependency Trees [Pelleg ’04] Bayes Network

Network Intrusion Detection [Ye and Xu ’00; Bronstein et al. ’01] Malicious Email Detection [Shih et al. ’04] Disease Outbreak Detection [Wong et al. ’03]

Association Rule Learners LERAD [Chan et al. ’06]

Learn rules of the form X → Y Anomaly score depends on P(¬Y|X)

Hidden Association Rules [Banderas et al. ’05]

Learn a probability distribution model from training data. Anomalies: Test set records having unusually low likelihood in the learnt model.

Page 8: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

8

Outline Problem Motivation/Overview Related Work Conditional Anomaly

Motivation/Definition Algorithm for Testing Records Estimating Probability Values Speedup Tricks

Marginal Anomaly Datasets and Results

Page 9: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

9

Conditional Anomaly Suppose, P(Commodity|Country) is a factor in the Bayes network. In test record t: Commodity = Gold, Country = China

P(Gold|China) = 0.001 P(Gold) = 0.001

Is this an anomaly?

Page 10: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

10

Suppose, P(Commodity|Country) is a factor in the Bayes network. In test record t: Commodity = Gold, Country = China

P(Gold|China) = 0.001 P(Gold) = 0.001

Conditional Anomaly

P(Gold)

China)|P(Gold)Country,yr(Commodit tt 1

0.001

0.001

Normalize

Page 11: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

11

Suppose, P(Commodity|Country) is a factor in the Bayes network. In test record t: Commodity = Gold, Country = China

P(Gold|China) = 0.001 P(Gold) = 0.001

Commodity = Copper, Country = China P(Copper|China) = 0.001 P(Copper) = 0.1

Conditional Anomaly

P(Copper)

China)|P(Copper)Country,yr(Commodit tt 01.

0.1

0.001

P(Gold)

China)|P(Gold)Country,yr(Commodit tt 1

0.001

0.001

Page 12: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

12

Conditional Anomaly

A small r value denotes that there is a strong negative dependence between the occurrence of values at and bt.

at and bt co-occurring in t and r(at , bt) << 1 t is Anomalous !!

))P(bP(a

)b,P(a

)P(b

)a|P(b

)P(a

)b|P(a)b,r(a

tt

tt

t

tt

t

tttt

r value is defined over two attribute values at and bt of attributes A and B in test record t.

Page 13: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

13

Conditional Anomaly

))P(bP(a

)b,P(a

)P(b

)a|P(b

)P(a

)b|P(a)b,r(a

tt

tt

t

tt

t

tttt

r value is defined over two attribute values at and bt of attributes A and B in test record t.

A small r value denotes that there is a strong negative dependence between the occurrence of values at and bt.

at and bt co-occurring in t and r(at , bt) << 1 t is Anomalous !! In general, we can consider sets of attributes A and B to calculate r. For example,

We limit the number of attributes in each set to k. Hence, we consider up to 2k attributes at a time.

)Country,Port )P(USyP(Commodit

)Commodity,Port US,P(Country)b,r(a

Country}Port, {USB

}{CommodityA

ttt

ttttt

Page 14: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

14

Conditional Anomaly – Algorithm Algorithm for testing record t

For each mutually exclusive pair of attribute sets {A, B} compute:

Score the record t based on all the r values. Heuristic 1: Assign the minimum r value as the score Heuristic 2 (Combining evidence) : Combine evidence from other subsets of

attributes by taking product of r values.

))P(bP(a

)b,P(a)b,r(a

tt

tttt Exponential number

of r values: O(m2k)

Page 15: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

15

Estimating Probability Values

Maximum Likelihood Estimation:

C(at) : Number of training instances having A=at

N : Total number of training cases.

Laplace Smoothing: Let p = P(at)

2N

1)C(aE[p] t

)C(b

N

)C(a

N

N

)b,C(a

))P(bP(a

)b,P(a

tt

tt

tt

tt

)b,r(a tt 1)C(b

2N

1)C(a

2N

2N

1)b,C(a

tt

tt

Page 16: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

16

Speedup Trick 1: Rare values of attributes can be ignored

Decrease the arity of attributes: Replace all rare values with a generic rare value.

α)b,r(a tt

α

1)C(b&

α

1)C(a tt

Estimating Probability Values

Page 17: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

17

Speedup Trick 2: To estimate counts from the training data, use a very efficient caching data-structure: AD-Tree [Moore and Lee ’98]

Naïve method is O(N). AD -Tree pre-computes the values of most queries, and requires a small

computation for some queries. It is independent of N. Construct an AD-Tree on the reduced arity attributes.

Estimating Probability Values

Page 18: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

18

Outline Problem Motivation/Overview Related Work Conditional Anomaly Marginal Anomaly Datasets and Results

Page 19: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

19

Marginal Anomaly What about rare values?

Import of Plutonium Import of $1 million worth

What is the probability of seeing something this rare or rarer? Consider attribute set A of up to k attributes.

)}()(:{where,)()( ta

it aAPxAPxaPaqvali

XX

Page 20: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

20

Marginal Anomaly What about rare values?

Import of Plutonium Import of $1 million worth

What is the probability of seeing something this rare or rarer? Consider attribute set A of up to k attributes.

47.0qval 01.0qval

Values of attribute A

Pro

ba

bili

ty o

f o

ccu

rre

nce

Pro

ba

bili

ty o

f o

ccu

rre

nce

Values of attribute B

a1 a2 a3 a4 a5a49 a50 b1 b2 b3 b4 b5

b6 b7 b8b9

)}()(:{where,)()( ta

it aAPxAPxaPaqvali

XX

Page 21: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

21

Outline Problem Motivation/Overview Related Work Conditional Anomaly Marginal Anomaly Datasets and Results

Page 22: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

22

Datasets PIERS Dataset

Attribute Arity

1 Country 22

2 Foreign Port 42

3 US Port 16

4 Shipping Line 4

5 Shipper Name 4218

6 Importer Name 6412

7 Commodity Description 1649

8 Size 5

9 Weight 5

10 Value 5

Page 23: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

23

Datasets PIERS Dataset

Attribute Arity

1 Country 22

2 Foreign Port 42

3 US Port 16

4 Shipping Line 4

5 Shipper Name 4218

6 Importer Name 6412

7 Commodity Description 1649

8 Size 5

9 Weight 5

10 Value 5

Page 24: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

24

Datasets PIERS Dataset

No labeled anomalies. Anomaly Generation – Method 1

Select a test record to be modified. Randomly choose an attribute. Flip the value of the chosen attribute, drawing from the attribute marginal.

Anomaly Generation – Method 2 Insert records from a different time-period.

100,000 training records and 10,000 test records. 10% of test record are generated to be anomalous.

Page 25: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

25

Results: PIERS Dataset

Performance of methods for random attribute flips

The proportion of true anomalies detected.

Positives#

PositivesTrue#

Page 26: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

26

Results: PIERS Dataset

Performance of methods for records inserted from different month

Page 27: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

27

Datasets KDD Cup 99 Dataset

Records correspond to individual network sessions. Features:

Basic features of an individual TCP connection: duration protocol type number of bytes transferred, etc.

Features obtained using some domain knowledge: number of file creation operations number of failed login attempts, etc.

Features computed using a two second time window: number of connections to the same service, etc.

In total there are 41 features, most of them taking continuous values. Discretized to 5 levels.

We selected six different attack types: apache2, guess password, mailbomb, neptune, snmpguess and

snmpgetattack.

Page 28: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

28

Results: KDD Cup 99

Comparison of performance for Conditional and Bayes Net methods on KDD Cup dataset

Page 29: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

29

Summary Detecting anomalies based on learning single probability distribution

model and computing whole record likelihoods is problematic: High arity leads to detecting rare attribute values. The signal in some features gets washed out in the noise of the rest of the

features. Anomalies highlight mistakes in model learning.

We propose new approaches to solve this: Considering all subsets of features up to some size. Define r-values which can indicate anomalies arising out of co-occurrence of

high negatively correlated values. Empirical results on real data sets demonstrate improved anomaly detection. The time and memory requirements for our algorithms is comparable to that

of the baseline methods.

Page 30: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

30

Thank You!

Please visit poster board #39

Page 31: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

31

Time and Memory RequirementsDataset Training

Size Test Size Number of

AttributesTraining

Time (secs)Testing

Time (secs)Memory

(MB)

CBP 100,000 10,000 10 6.9 4.7 4.5

KDD Cup 99 100,000 10,000 41 297 1.6 152

Dataset Number of Attributes

k Training Time (secs)

Testing Time (secs)

Memory (MB)

Marginal Memory (MB)

CBP 10 1 7.6 16.8 337 334

2 7.8 133 338 340

3 9.3 790 341 489

KDD Cup 99 41 1 10.2 15 323 222

2 44 7145 332 2618

Table 2: Time and Space requirement for Conditional and Marginal Methods

Table 1: Time and Space requirement for Bayes Network Method

Page 32: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

32

Related Work Supervised classification based approaches

Decision Trees [Lee et al. ’98] Neural Network [Ghosh et al. ’99] SVMs [Li et al. ’03; Shon et al. ’05] Sequence Analysis [Hofmeyr et al.’98; Helman ‘97]

Unsupervised approaches applied to real valued data k-NN [Yang and Liu ’99] Clustering [Eskin ’02] GMM [Roberts and Tarassenko ’94]

Page 33: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

33

Speedup Trick 1: Rare values of attributes can be ignored

α)b,r(a tt

α1)C(b

2N

1)C(a

2N

2N

1)b,C(a

tt

tt

α1)C(a

1)b,C(a

t

tt

α1)C(a

1

t

1)C(a t

α

1)C(a t

Estimating Probability Values

Page 34: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

34

Marginal Anomaly Detection In test record t, the attribute combination A has value at.

Compute C(at), and C(ai) for all values of ai of A which are rarer than at. For each attribute set A, precompute:

Histogram function, hA(i) : number of values of A that occur i times in training data.

Cumulative histogram, chA(i) : number of records having values of A that occur i times or less in the training data.

Construct a non-reduced AD-Tree on the training data.

i

jA h(j)j(i)ch

1

N

aCchapval tA

t

))(()(

Page 35: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

35

Marginal Anomaly Detection Testing Algorithm for record t

For each composite attribute A compute:

Assign the minimum p-value as the score of record t

N

aCchapval tA

t

))(()(

Page 36: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

36

Grouping Group (cluster) the records according to some similarity measure. Find

clusters in data with unusual concentration of anomalies. k-NN k-means (modes) GDA (link detection)

Define a collection of groups independent of data. Search over all possible groups to find unusual concentration of anomalies WSARE Scan Stats

Page 37: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

37

Grouping Group (cluster) the records according to some similarity measure. Find clusters

in data with unusual concentration of anomalies (Dis)similarity measure: Hamming Distance

Generative model:

XXT5

XXT4

XXXT3

XXT2

XXT1

G6G5G4G3G2G1

P0(d); PA(d)

03742T5

30663T4

76073T3

46701T2

23310T1

T5T4T3T2T1

Group Chart (G)

Distribution of distances

N x N Hamming distances

Page 38: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

38

Grouping Group (cluster) the records according to some similarity measure. Find clusters

in data with unusual concentration of anomalies (Dis)similarity measure: Hamming Distance

Generative model:

ji

jiTTIG

TTDPjiG

),(max ),(

XXT5

XXT4

XXXT3

XXT2

XXT1

G6G5G4G3G2G1

P0(d); PA(d)

03742T5

30663T4

76073T3

46701T2

23310T1

T5T4T3T2T1

Group Chart (G)

Distribution of distances

N x N Hamming distances

Page 39: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

39

Grouping Group (cluster) the records according to some similarity measure. Find clusters

in data with unusual concentration of anomalies (Dis)similarity measure: Hamming Distance

Likelihood ratio – Null Hypothesis: Attribute values belong to the training distribution Alternate Hypothesis: Attribute values come from a different distribution

Include records Ti in G to maximize LRG.

Constraints Include all records within a distance R of the cluster center (leader).

GTiA

GTi

G

i

i

TL

TL

LR

Page 40: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

40

Grouping Define a collection of groups independent of data. Search over all possible

groups to find unusual concentration of anomalies All records with matching value(s) of particular attribute(s). Detection method:

Fisher’s exact test on:

Low values of the ratio:

Mtest(A=a) Ntest(A=a)

Mtrain(A=a) Ntrain(A=a)

testtest

traintrain

NM

NMR

/

/

Page 41: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

41

Grouping Detection method:

Fisher’s exact test on:

Low values of the ratio:

Mtest(A=a) Ntest(A=a)

Mtrain(A=a) Ntrain(A=a)

testtest

traintrain

NM

NMR

/

/

Examples:

group no: 1; Score: 3.508139e-01Train Total: 1065; Train Anom: 19; Test Total: 132; Test Anom: 81:LOS_ANGELES, 7:v4:15+, 9:v4:500000+,

group no: 2; Score: 4.409127e-01Train Total: 985; Train Anom: 32; Test Total: 105; Test Anom: 150:YANTIAN, 3:MLSL, 4:CHASTINE_MAERSK,

group no: 3; Score: 4.409127e-01Train Total: 985; Train Anom: 32; Test Total: 105; Test Anom: 150:YANTIAN, 4:CHASTINE_MAERSK,

Page 42: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

42

Comparisons

Comparison of performance by grouping anomalies

Page 43: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

43

Comparisons WSARE

Scan Statistics Real valued geographic coordinates.

Mtest(A=a) Ntest(A=a)

Mtrain(A=a) Ntrain(A=a)

Ntest(A=a) Ntest(A≠a)

Ntrain(A=a) Ntrain(A≠a)

Page 44: 1 Anomaly detection in Categorical Datasets Kaustav Das, Jeff Schneider Machine Learning Department Carnegie Mellon University

44

Conditional Anomaly

A small r value denotes that there is a strong negative dependence between the occurrence of values at and bt.

When we observe at and bt co-occurring in a record t and r has an unusually low value, we conclude that this is an anomaly.

In general, we can consider sets of attributes A and B to calculate r. For example,

We limit the number of attributes in each set to k. Hence, we consider up to 2k attributes at a time.

))P(bP(a

)b,P(a

)P(b

)a|P(b

)P(a

)b|P(a)b,r(a

tt

tt

t

tt

t

tttt

)Country,Port )P(USyP(Commodit

)Commodity,Port US,P(Country)b,r(a

Country}Port, {USB

}{CommodityA

ttt

ttttt