a false positive safe neural network for spam detection
DESCRIPTION
A False Positive Safe Neural Network for Spam Detection. Alexandru Catalin Cosoi [email protected]. Does this look familiar?. Anatrim. Oh boy, it’s getting worst!!!. Oh boy, it’s getting worst!!!. Bad Bad Spammer!!!. Databases: D: Random legitimate text - PowerPoint PPT PresentationTRANSCRIPT
![Page 2: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/2.jpg)
Does this look familiar?
![Page 3: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/3.jpg)
Anatrim
![Page 4: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/4.jpg)
Oh boy, it’s getting worst!!!
![Page 5: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/5.jpg)
Oh boy, it’s getting worst!!!
![Page 6: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/6.jpg)
Bad Bad Spammer!!!
• Databases:• D: Random legitimate text
• D1: Different rephrases of a certain spam phrase
• D2: Different rephrases of another spam phrase
• …………………
• Dn: Different rephrases of another spam phrase
– Create spam message script:
– Choose a random phrase from D1
– Choose random text from D
– Choose a random phrase from D2
– Choose random text from D– …………….
– Chose random phrase from Dn
• Send message.
• 40 samples of different subjects
• 50 samples of different titles
• 30 samples of different titles (part II)
• 60000 different combinations
Appeared as a consequence
of botnets
![Page 7: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/7.jpg)
Features
• Larger time frame – KeyWord!!!!• Weak features
– Words like “Anatrim”, “Viagra”, “Xanax”, “Stock”– Simple word combinations like “Stock alert”, “Strong buy”– Simple Header Heuristics (for both spam and ham) like: valid
reply, weird message id, forged headers
• Example:– Top 500 spammy words from a Bayesian dictionary– Some simple header heuristics from spamassasins’ SARE
Ninjas– Trainer’s personal flavour
![Page 8: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/8.jpg)
Why ART?
• Training occurs by modifying the weights of each neuron
• For large amounts of data, forgetting important details might actually happen
• Solves the stability-plasticity dilemma• Based on template detection• Unlimited number of templates
involves unlimited number of patterns• 2 self organizing neural networks + a
mapping module = supervised organizing neural network
![Page 9: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/9.jpg)
Adaptive Resonance Theory
• Similar to a cluster algorithm (as many clusters as needed)
• ARTMAP = ARTa + ARTb + MapField
![Page 10: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/10.jpg)
ART Vigilance
Small Value - Imprecise Big value - Fragmented
• A big value: Accepts small errors; Many small clusters; High precision• A small value: Accepts high errors; A few big clusters; Errors can appear
![Page 11: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/11.jpg)
ART ++
![Page 12: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/12.jpg)
Algorithm
![Page 13: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/13.jpg)
Corpus
• 2.5 million spam messages (sampled on waves with a high degree of variation) and around 1000 simple low relevance text heuristics (not counting the standard header heuristics).
• The first 1000 words (ordered by discrimination, but with a minimum of 10-30 hundred occurrences) from a bayesian dictionary trained on this corpus, and also standard header heuristics.
• Almost 1 million legitimate email messages• 75% of the message corpus were used for training the neural
network and,• 25% were used in testing the neural network.
• 1.5 days to train!!!!
![Page 14: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/14.jpg)
Results
• FP: 1% 0.0001%• FN: 4% 20 %
• On some corpuses (TREC 2006) we had … not so great results (but current heuristics)
• FN: 35% ()• FP: 2 email messages! ()
• At least, just a few false positives!
![Page 15: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/15.jpg)
Conclusions
• ART + Simple Features + Spam = Love• ART + False Positives + Spam = OMG!!!• (ART++) = Heuristic Filter + ARTMAP• Must use a lot of email messages. It is highly difficult to
find representative samples for individual waves.• Can also be applied to other neural networks• Interesting PowerPoint template…
![Page 16: A False Positive Safe Neural Network for Spam Detection](https://reader036.vdocuments.site/reader036/viewer/2022081513/5681503d550346895dbe3bd7/html5/thumbnails/16.jpg)
Thanks
QUESTIONS?