on the study of anomaly-based spam filtering using spam as representation of normality - ccnc - rsw...
DESCRIPTION
Presentation at CCNC's - Research Student Workshop 2012 of the paper: On the Study of Anomaly-based Spam Filtering Using Spam as Representation of NormalityTRANSCRIPT
Carlos Laorden
WHAT YOU GOT, THEN? SPAM, EGG,
SPAM, SPAM, BACON AND
SPAM.
SPAM, SPAM, SPAM, BAKED BEANS AND
SPAM.
ANYTHING WITHOUT
SPAM?
I DON’T LIKE SPAM!!
UGH!
Meet the real SPiced hAM
Monty Python’s Flying Circus
Something that repeats and repeats until being annoying
It is a
real problem for Information Security
Billions of daily losses in
productivity
Infected computers
Stolen credentials
We must
fight
Anti-spam methods
Pre-sending
New
protocols
Post-sending
Increase sending
costs Increase risks
for spammers
sender
content
content
Usually
supervised approaches
A significant
labelling work is needed
A significant
labelling work is needed
But, is this
possible?
I mean, is this
possible...
YES
Anomaly Detection
no interest this SpamAssassin word has
this has Ling Spam no interest word
SpamAssassin
Ling Spam t1
t2
t3 D1
D2
D10 D3
D9
D4
D7
D8
D5
D11
D6
? ?
Anomaly detection
d
d > threshold?
> threshold?
Manhattan distance
Euclidean distance
Anomaly detection
?
d
d ?
Minimum distance
Maximum distance
Mean distance
Minimum
distance
Maximum
distance
Mean
distance
Manhattan
distance
Euclidean
distance
10 different
thresholds
Anomaly detection
d
d < threshold
> threshold
Minimum
distance
Maximum
distance
Mean
distance
Manhattan
distance
Euclidean
distance
10
thresholds
What do
we get?
Detects more than 93% of junk emails
Less than 5% of
misclassified legitimate emails
Detects more than 91% of junk emails
An improvable 8% of
misclassified legitimate emails
Suitable to
overcome the amount
of unclassified spam e-mails
More?
Minimum distance
Maximum distance
Mean distance
Minimum distance
Maximum distance
Mean distance
?
d
d ?
You have new e-mail?
Legitimate? Really?
What is the anomaly?
Anomaly
Normality
Results
SpamAssassin Manhattan Euclidean
Prec. Rec. F-Meas. Prec. Rec. F-Meas.
Mean 3.83% 100% 7.37% 3.83% 100% 7.37%
Maximum 3.98% 67.92% 7.53% 5.23% 35.63% 9.13%
Minimum 16.48% 12.50% 14.22% 58.73% 15.42% 24.42%
Ling Spam Manhattan Euclidean
Prec. Rec. F-Meas. Prec. Rec. F-Meas.
Mean 8.37% 100% 15.45% 8.37% 100% 15.45%
Maximum 8.37% 100% 15.45% 20.75% 56.59% 30.37%
Minimum 56.88% 23.10% 32.86% 71.58% 40.51% 51.74%
Anomaly
Normality
References
1. Monty Python – Spam: http://www.youtube.com/watch?v=anwy2MPT5RE
2. Spam wall by freezelight: http://www.flickr.com/photos/63056612@N00/155554663/
3. monty python flying circus by the_d8_show: http://www.flickr.com/photos/8056839@N04/478599790/
4. Dollars: http://vegasgravy.com/News-detail/two-women-
caught-for-transporting-drug-money-from-vegas/dollars/
5. Day 97: Infected by dustywrath: http://www.flickr.com/photos/10921499@N07/2187318683
6. my bank sucks by B Rosen: http://www.flickr.com/photos/rosengrant/3537904106/
7. Feet on table: http://bisystembuilders.com/wp-
content/uploads/2010/02/shutterstock_feet-on-table.jpg
8. Buried on bills: http://getupkids.net/wp-
content/uploads/2013/06/debt_piling.jpg
9. Kill spam: http://www.email-marketing-wizard.com/wp-
content/uploads/2010/03/spammer.jpg