1 selecting features for intrusion detection: a feature relevance analysis on kdd 99 benchmark h....
TRANSCRIPT
1
Selecting Features for Intrusion Detection: A Feature Relevance Analysis on KDD 99
Benchmark
H. Güneş KayacıkNur Zincir-Heywood Malcolm I. Heywood
2
Motivation
• Machine learning in detection.• Raw data High level events• Need a set of features• Not “any” feature, “good” features• How do we quantify “good”?
3
The Data
• DARPA 98 and 99 datasets.
• Simulated activity.
• Network traffic connection records
• 41 feature per connection.
107201
97277
280790DoS1
DoS2
Normal
4
The Data
• 494,000 connections in dataset.• 23 Class Labels
22 Attacks (DoS, probe, content based) “Normal”
• 41 Features (few examples) Duration Service Protocol Data transfer
Failed login attempts FTP commands Root shells “Su” attempts
5
Previous IDS Work
• Decision trees, neural nets, clustering, SVM, EC
• High detection (98%) Low FP (0.5%)• Some attacks are detected better
than others.• Our task: Substantiate the
performance of detectors.
6
Information Gain
• Used in decision trees.• Which feature leads to the purest
branching?Gain (“Temperature”) = 0.571
Gain (“Windy”) = 0.02Gain (“Humidity”) = 0.971
From Data Mining Course at KDNuggets site [http://www.kdnuggets.com/dmcourse/data_mining_course]
7
Methodology
• Classes: 22 Attacks + 1 Normal
• Binary classification(Why?)
• 23 Info. Gains per feature(vs. 1 Info Gain per feature)
1, 0.5, 90, 8 Class A
3, 0.01, 7, 9 Class B
2, 0.1,, 7, 10 Class A
5, 0.2, 10, 1 Class C
1
0
1
0
For Class A:
8
Max. Information Gain
• Some relevant some not
• Features 20 and 21
9
For each class…
Info. Gain
0
0.2
0.4
0.6
0.8
1
back
buffer
_ove
rflow
ftp_w
rite
gues
s_pa
sswd
imap
ipsw
eep
land
load
mod
ule
mul
tihop
nept
une
nmap
norm
alpe
rlph
fpo
d
ports
weep
root
kit
sata
n
smur
fsp
y
tear
drop
warez
clie
nt
warez
mas
ter
• Neptune (DoS) + smurf (DoS) + normal = 98%
10
Relevant Classes
11
10
10
1
1
12
1 1 1
normal smurf
neptune land
teardrop ftp_write
back guess_pwd
buffer_overflow warezclient
• 31/41 most relevant for 3 major classes.
• 9 features contributed very little.
• Relevant Features Connection Size Diff. Service Rate Connection state
11
Conclusions
• Relevance analysis on KDD 99 dataset.• Relevance Information gain.• Key Points
Easy to classify 3 major classes. Few features highly useful. Few features completely useless.
• New measures and extended analysis.
12
Thank You!
• You can find more information about our research at: www.cs.dal.ca/projectx.