1 anomaly detection in categorical datasets kaustav das, jeff schneider machine learning department...
TRANSCRIPT
1
Anomaly detection in Categorical DatasetsKaustav Das, Jeff Schneider
Machine Learning DepartmentCarnegie Mellon University
2
Outline Problem Motivation/Overview Related Work Conditional Anomaly Marginal Anomaly Datasets and Results
3
Problem Motivation
Import of Containers
Astronomical Data Emergency Department
Network Intrusion Detection
Detect anomalous records in large amount of record based data
4
Problem Overview
Training Data: Categorical dataset – categorize real values. Large number of records – 100,000 to 1 million. Unlabelled: A small fraction of them (<1-2%) can be anomalous. Attributes can have high arity of up to 5000-10,000.
Test Data: Same properties as above. Can have any fraction of anomalous records.
Goal: To detect records in the test set that are ‘anomalous’. More generally, score each test record with the degree of anomalousness. Flag records based on the desired false positive rate.
5
Problem Overview
FPORT USPORT COUNTRY SLINE VESSEL SHIPPER NAME F NAME COMMODITY SIZE MTONS VALUE
YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE AMERICAN_TRI_NET_EXPRESSTRI_NET EMPTY_RACK 0 5.6 27579YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OF_SHIPPERUSED_TIRE 2 13.43 9497YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OF_SHIPPERUSED_TIRE 2 13.43 9497YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE AMERICAN_TRI_NET_EXPRESSTRI_NET CRUDE_IODINE_PURITY 1 17.68 251151YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE NEW_WAVE_TRANSPORT JIT PANELS_F_MODEL_98 3 39.57 65169YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE ORDER ORDER_OF_SHIPPERUSED_TIRES 2 13.43 9497YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE CHINA_OCEAN_SHPG CHINA_OCEAN_SHPG_AGENCYEMPTY_CONTAINERS 0 0 0YOKOHAMA SEATTLE JAPAN CSCO LING_YUN_HE CHINA_OCEAN_SHPG CHINA_OCEAN_SHPG_AGENCYEMPTY_CONTAINERS 0 0 0
Example Dataset – PIERS Data
6
Related Work Likelihood Based Methods
Dependency Trees [Pelleg ’04] Bayes Network
Network Intrusion Detection [Ye and Xu ’00; Bronstein et al. ’01] Malicious Email Detection [Shih et al. ’04] Disease Outbreak Detection [Wong et al. ’03]
Learn a probability distribution model from training data. Anomalies: Test set records having unusually low likelihood in the learnt model.
7
Related Work Likelihood Based Methods
Dependency Trees [Pelleg ’04] Bayes Network
Network Intrusion Detection [Ye and Xu ’00; Bronstein et al. ’01] Malicious Email Detection [Shih et al. ’04] Disease Outbreak Detection [Wong et al. ’03]
Association Rule Learners LERAD [Chan et al. ’06]
Learn rules of the form X → Y Anomaly score depends on P(¬Y|X)
Hidden Association Rules [Banderas et al. ’05]
Learn a probability distribution model from training data. Anomalies: Test set records having unusually low likelihood in the learnt model.
8
Outline Problem Motivation/Overview Related Work Conditional Anomaly
Motivation/Definition Algorithm for Testing Records Estimating Probability Values Speedup Tricks
Marginal Anomaly Datasets and Results
9
Conditional Anomaly Suppose, P(Commodity|Country) is a factor in the Bayes network. In test record t: Commodity = Gold, Country = China
P(Gold|China) = 0.001 P(Gold) = 0.001
Is this an anomaly?
10
Suppose, P(Commodity|Country) is a factor in the Bayes network. In test record t: Commodity = Gold, Country = China
P(Gold|China) = 0.001 P(Gold) = 0.001
Conditional Anomaly
P(Gold)
China)|P(Gold)Country,yr(Commodit tt 1
0.001
0.001
Normalize
11
Suppose, P(Commodity|Country) is a factor in the Bayes network. In test record t: Commodity = Gold, Country = China
P(Gold|China) = 0.001 P(Gold) = 0.001
Commodity = Copper, Country = China P(Copper|China) = 0.001 P(Copper) = 0.1
Conditional Anomaly
P(Copper)
China)|P(Copper)Country,yr(Commodit tt 01.
0.1
0.001
P(Gold)
China)|P(Gold)Country,yr(Commodit tt 1
0.001
0.001
12
Conditional Anomaly
A small r value denotes that there is a strong negative dependence between the occurrence of values at and bt.
at and bt co-occurring in t and r(at , bt) << 1 t is Anomalous !!
))P(bP(a
)b,P(a
)P(b
)a|P(b
)P(a
)b|P(a)b,r(a
tt
tt
t
tt
t
tttt
r value is defined over two attribute values at and bt of attributes A and B in test record t.
13
Conditional Anomaly
))P(bP(a
)b,P(a
)P(b
)a|P(b
)P(a
)b|P(a)b,r(a
tt
tt
t
tt
t
tttt
r value is defined over two attribute values at and bt of attributes A and B in test record t.
A small r value denotes that there is a strong negative dependence between the occurrence of values at and bt.
at and bt co-occurring in t and r(at , bt) << 1 t is Anomalous !! In general, we can consider sets of attributes A and B to calculate r. For example,
We limit the number of attributes in each set to k. Hence, we consider up to 2k attributes at a time.
)Country,Port )P(USyP(Commodit
)Commodity,Port US,P(Country)b,r(a
Country}Port, {USB
}{CommodityA
ttt
ttttt
14
Conditional Anomaly – Algorithm Algorithm for testing record t
For each mutually exclusive pair of attribute sets {A, B} compute:
Score the record t based on all the r values. Heuristic 1: Assign the minimum r value as the score Heuristic 2 (Combining evidence) : Combine evidence from other subsets of
attributes by taking product of r values.
))P(bP(a
)b,P(a)b,r(a
tt
tttt Exponential number
of r values: O(m2k)
15
Estimating Probability Values
Maximum Likelihood Estimation:
C(at) : Number of training instances having A=at
N : Total number of training cases.
Laplace Smoothing: Let p = P(at)
2N
1)C(aE[p] t
)C(b
N
)C(a
N
N
)b,C(a
))P(bP(a
)b,P(a
tt
tt
tt
tt
)b,r(a tt 1)C(b
2N
1)C(a
2N
2N
1)b,C(a
tt
tt
16
Speedup Trick 1: Rare values of attributes can be ignored
Decrease the arity of attributes: Replace all rare values with a generic rare value.
α)b,r(a tt
α
1)C(b&
α
1)C(a tt
Estimating Probability Values
17
Speedup Trick 2: To estimate counts from the training data, use a very efficient caching data-structure: AD-Tree [Moore and Lee ’98]
Naïve method is O(N). AD -Tree pre-computes the values of most queries, and requires a small
computation for some queries. It is independent of N. Construct an AD-Tree on the reduced arity attributes.
Estimating Probability Values
18
Outline Problem Motivation/Overview Related Work Conditional Anomaly Marginal Anomaly Datasets and Results
19
Marginal Anomaly What about rare values?
Import of Plutonium Import of $1 million worth
What is the probability of seeing something this rare or rarer? Consider attribute set A of up to k attributes.
)}()(:{where,)()( ta
it aAPxAPxaPaqvali
XX
20
Marginal Anomaly What about rare values?
Import of Plutonium Import of $1 million worth
What is the probability of seeing something this rare or rarer? Consider attribute set A of up to k attributes.
47.0qval 01.0qval
Values of attribute A
Pro
ba
bili
ty o
f o
ccu
rre
nce
Pro
ba
bili
ty o
f o
ccu
rre
nce
Values of attribute B
a1 a2 a3 a4 a5a49 a50 b1 b2 b3 b4 b5
b6 b7 b8b9
)}()(:{where,)()( ta
it aAPxAPxaPaqvali
XX
21
Outline Problem Motivation/Overview Related Work Conditional Anomaly Marginal Anomaly Datasets and Results
22
Datasets PIERS Dataset
Attribute Arity
1 Country 22
2 Foreign Port 42
3 US Port 16
4 Shipping Line 4
5 Shipper Name 4218
6 Importer Name 6412
7 Commodity Description 1649
8 Size 5
9 Weight 5
10 Value 5
23
Datasets PIERS Dataset
Attribute Arity
1 Country 22
2 Foreign Port 42
3 US Port 16
4 Shipping Line 4
5 Shipper Name 4218
6 Importer Name 6412
7 Commodity Description 1649
8 Size 5
9 Weight 5
10 Value 5
24
Datasets PIERS Dataset
No labeled anomalies. Anomaly Generation – Method 1
Select a test record to be modified. Randomly choose an attribute. Flip the value of the chosen attribute, drawing from the attribute marginal.
Anomaly Generation – Method 2 Insert records from a different time-period.
100,000 training records and 10,000 test records. 10% of test record are generated to be anomalous.
25
Results: PIERS Dataset
Performance of methods for random attribute flips
The proportion of true anomalies detected.
Positives#
PositivesTrue#
26
Results: PIERS Dataset
Performance of methods for records inserted from different month
27
Datasets KDD Cup 99 Dataset
Records correspond to individual network sessions. Features:
Basic features of an individual TCP connection: duration protocol type number of bytes transferred, etc.
Features obtained using some domain knowledge: number of file creation operations number of failed login attempts, etc.
Features computed using a two second time window: number of connections to the same service, etc.
In total there are 41 features, most of them taking continuous values. Discretized to 5 levels.
We selected six different attack types: apache2, guess password, mailbomb, neptune, snmpguess and
snmpgetattack.
28
Results: KDD Cup 99
Comparison of performance for Conditional and Bayes Net methods on KDD Cup dataset
29
Summary Detecting anomalies based on learning single probability distribution
model and computing whole record likelihoods is problematic: High arity leads to detecting rare attribute values. The signal in some features gets washed out in the noise of the rest of the
features. Anomalies highlight mistakes in model learning.
We propose new approaches to solve this: Considering all subsets of features up to some size. Define r-values which can indicate anomalies arising out of co-occurrence of
high negatively correlated values. Empirical results on real data sets demonstrate improved anomaly detection. The time and memory requirements for our algorithms is comparable to that
of the baseline methods.
30
Thank You!
Please visit poster board #39
31
Time and Memory RequirementsDataset Training
Size Test Size Number of
AttributesTraining
Time (secs)Testing
Time (secs)Memory
(MB)
CBP 100,000 10,000 10 6.9 4.7 4.5
KDD Cup 99 100,000 10,000 41 297 1.6 152
Dataset Number of Attributes
k Training Time (secs)
Testing Time (secs)
Memory (MB)
Marginal Memory (MB)
CBP 10 1 7.6 16.8 337 334
2 7.8 133 338 340
3 9.3 790 341 489
KDD Cup 99 41 1 10.2 15 323 222
2 44 7145 332 2618
Table 2: Time and Space requirement for Conditional and Marginal Methods
Table 1: Time and Space requirement for Bayes Network Method
32
Related Work Supervised classification based approaches
Decision Trees [Lee et al. ’98] Neural Network [Ghosh et al. ’99] SVMs [Li et al. ’03; Shon et al. ’05] Sequence Analysis [Hofmeyr et al.’98; Helman ‘97]
Unsupervised approaches applied to real valued data k-NN [Yang and Liu ’99] Clustering [Eskin ’02] GMM [Roberts and Tarassenko ’94]
33
Speedup Trick 1: Rare values of attributes can be ignored
α)b,r(a tt
α1)C(b
2N
1)C(a
2N
2N
1)b,C(a
tt
tt
α1)C(a
1)b,C(a
t
tt
α1)C(a
1
t
1α
1)C(a t
α
1)C(a t
Estimating Probability Values
34
Marginal Anomaly Detection In test record t, the attribute combination A has value at.
Compute C(at), and C(ai) for all values of ai of A which are rarer than at. For each attribute set A, precompute:
Histogram function, hA(i) : number of values of A that occur i times in training data.
Cumulative histogram, chA(i) : number of records having values of A that occur i times or less in the training data.
Construct a non-reduced AD-Tree on the training data.
i
jA h(j)j(i)ch
1
N
aCchapval tA
t
))(()(
35
Marginal Anomaly Detection Testing Algorithm for record t
For each composite attribute A compute:
Assign the minimum p-value as the score of record t
N
aCchapval tA
t
))(()(
36
Grouping Group (cluster) the records according to some similarity measure. Find
clusters in data with unusual concentration of anomalies. k-NN k-means (modes) GDA (link detection)
Define a collection of groups independent of data. Search over all possible groups to find unusual concentration of anomalies WSARE Scan Stats
37
Grouping Group (cluster) the records according to some similarity measure. Find clusters
in data with unusual concentration of anomalies (Dis)similarity measure: Hamming Distance
Generative model:
XXT5
XXT4
XXXT3
XXT2
XXT1
G6G5G4G3G2G1
P0(d); PA(d)
03742T5
30663T4
76073T3
46701T2
23310T1
T5T4T3T2T1
Group Chart (G)
Distribution of distances
N x N Hamming distances
38
Grouping Group (cluster) the records according to some similarity measure. Find clusters
in data with unusual concentration of anomalies (Dis)similarity measure: Hamming Distance
Generative model:
ji
jiTTIG
TTDPjiG
),(max ),(
XXT5
XXT4
XXXT3
XXT2
XXT1
G6G5G4G3G2G1
P0(d); PA(d)
03742T5
30663T4
76073T3
46701T2
23310T1
T5T4T3T2T1
Group Chart (G)
Distribution of distances
N x N Hamming distances
39
Grouping Group (cluster) the records according to some similarity measure. Find clusters
in data with unusual concentration of anomalies (Dis)similarity measure: Hamming Distance
Likelihood ratio – Null Hypothesis: Attribute values belong to the training distribution Alternate Hypothesis: Attribute values come from a different distribution
Include records Ti in G to maximize LRG.
Constraints Include all records within a distance R of the cluster center (leader).
GTiA
GTi
G
i
i
TL
TL
LR
40
Grouping Define a collection of groups independent of data. Search over all possible
groups to find unusual concentration of anomalies All records with matching value(s) of particular attribute(s). Detection method:
Fisher’s exact test on:
Low values of the ratio:
Mtest(A=a) Ntest(A=a)
Mtrain(A=a) Ntrain(A=a)
testtest
traintrain
NM
NMR
/
/
41
Grouping Detection method:
Fisher’s exact test on:
Low values of the ratio:
Mtest(A=a) Ntest(A=a)
Mtrain(A=a) Ntrain(A=a)
testtest
traintrain
NM
NMR
/
/
Examples:
group no: 1; Score: 3.508139e-01Train Total: 1065; Train Anom: 19; Test Total: 132; Test Anom: 81:LOS_ANGELES, 7:v4:15+, 9:v4:500000+,
group no: 2; Score: 4.409127e-01Train Total: 985; Train Anom: 32; Test Total: 105; Test Anom: 150:YANTIAN, 3:MLSL, 4:CHASTINE_MAERSK,
group no: 3; Score: 4.409127e-01Train Total: 985; Train Anom: 32; Test Total: 105; Test Anom: 150:YANTIAN, 4:CHASTINE_MAERSK,
42
Comparisons
Comparison of performance by grouping anomalies
43
Comparisons WSARE
Scan Statistics Real valued geographic coordinates.
Mtest(A=a) Ntest(A=a)
Mtrain(A=a) Ntrain(A=a)
Ntest(A=a) Ntest(A≠a)
Ntrain(A=a) Ntrain(A≠a)
44
Conditional Anomaly
A small r value denotes that there is a strong negative dependence between the occurrence of values at and bt.
When we observe at and bt co-occurring in a record t and r has an unusually low value, we conclude that this is an anomaly.
In general, we can consider sets of attributes A and B to calculate r. For example,
We limit the number of attributes in each set to k. Hence, we consider up to 2k attributes at a time.
))P(bP(a
)b,P(a
)P(b
)a|P(b
)P(a
)b|P(a)b,r(a
tt
tt
t
tt
t
tttt
)Country,Port )P(USyP(Commodit
)Commodity,Port US,P(Country)b,r(a
Country}Port, {USB
}{CommodityA
ttt
ttttt