a neural network classifier for junk e-mail ian stuart, sung-hyuk cha, and charles tappert csis...
TRANSCRIPT
A Neural Network A Neural Network Classifier for Junk E-MailClassifier for Junk E-Mail
Ian Stuart, Sung-Hyuk Cha, and Charles TappertIan Stuart, Sung-Hyuk Cha, and Charles TappertCSIS Student/Faculty Research DayCSIS Student/Faculty Research Day
May 7, 2004May 7, 2004
Fighting spamFighting spam
Several commercial applications existSeveral commercial applications exist– Server-side: expensiveServer-side: expensive– Client-side: time-consumingClient-side: time-consuming
No approach is 100% effectiveNo approach is 100% effective– Spammers are aggressive and adaptableSpammers are aggressive and adaptable– Best solutions are typically hybrids of Best solutions are typically hybrids of
different approaches and criteriadifferent approaches and criteria
Common approachesCommon approaches
Simple filtersSimple filters– Common words or phrasesCommon words or phrases– Unusual punctuation or capitalizationUnusual punctuation or capitalization
Blacklisting: “just say NO” (if you can)Blacklisting: “just say NO” (if you can)
– Reject e-mail from known spammersReject e-mail from known spammers
Whitelisting: “friends only, please”Whitelisting: “friends only, please”
– Accept e-mail only from known correspondentsAccept e-mail only from known correspondents
Classifiers: examine each e-mail and decideClassifiers: examine each e-mail and decide– Only a few publications on spam classifiersOnly a few publications on spam classifiers
Naïve Bayesian classifiersNaïve Bayesian classifiers
Used in commercial classifiers Used in commercial classifiers Assumes recognition features are independentAssumes recognition features are independent
– Max likelihood = product of likelihoods of featuresMax likelihood = product of likelihoods of features E-mail classifier – examines each wordE-mail classifier – examines each word
– Training assigns a probability to each wordTraining assigns a probability to each word– Look up each word/probability in a dictionaryLook up each word/probability in a dictionary– If the product of the probabilities exceeds a given If the product of the probabilities exceeds a given
threshold, it is spamthreshold, it is spam Challenge – creating the “dictionary”Challenge – creating the “dictionary” We compare our Neural Network against two We compare our Neural Network against two
published Naïve Bayesian classifierspublished Naïve Bayesian classifiers
Naïve Bayesian classifier issuesNaïve Bayesian classifier issues
How many features (words), which ones?How many features (words), which ones? How is degradation avoided as spammers’ How is degradation avoided as spammers’
vocabulary changes?vocabulary changes? What values are assigned to new words?What values are assigned to new words? What are the thresholds?What are the thresholds? How to avoid “sabotage” of classifier?How to avoid “sabotage” of classifier?
Which one isn’t spam?Which one isn’t spam?(subject headers)(subject headers)
5 Be a mighty warrior in bed! vcrhwt ygjztyjjh5 Be a mighty warrior in bed! vcrhwt ygjztyjjh
Money Back Guarantee_HGHMoney Back Guarantee_HGH
kindle life pddez liw mzackindle life pddez liw mzac
v a l i u m - D i a z e p a m used to relieve anxietyv a l i u m - D i a z e p a m used to relieve anxiety
Fairfield tennis scheduleFairfield tennis schedule
:Dramatic E,nhancement fo=r .Men = f"fumqid:Dramatic E,nhancement fo=r .Men = f"fumqid
,Refina'nce now. Don't wait,Refina'nce now. Don't wait
Which one isn’t spam?Which one isn’t spam? (subject headers) (subject headers)
5 Be a mighty warrior in bed! vcrhwt ygjztyjjh5 Be a mighty warrior in bed! vcrhwt ygjztyjjh
Money Back Guarantee_HGHMoney Back Guarantee_HGH
kindle life pddez liw mzackindle life pddez liw mzac
v a l i u m - D i a z e p a m used to relieve anxietyv a l i u m - D i a z e p a m used to relieve anxiety
Fairfield tennis scheduleFairfield tennis schedule
:Dramatic E,nhancement fo=r .Men = f"fumqid:Dramatic E,nhancement fo=r .Men = f"fumqid
,Refina'nce now. Don't wait,Refina'nce now. Don't wait
Spammers make patternsSpammers make patterns
The more they try to hide, the easier it The more they try to hide, the easier it is to see themis to see them
Therefore, we use common spammer Therefore, we use common spammer patterns (instead of vocabulary) as patterns (instead of vocabulary) as features for classificationfeatures for classification
Learn these patterns with a Neural Learn these patterns with a Neural NetworkNetwork
Neural Network featuresNeural Network features
Total of 17 featuresTotal of 17 features– 6 from the subject header6 from the subject header
– 2 from priority and content-type headers2 from priority and content-type headers
– 9 from the e-mail body9 from the e-mail body
Features from subject headerFeatures from subject header
1.1. Number of words with no vowelsNumber of words with no vowels
2.2. Number of words with at least two of letters J, K, Q, X, ZNumber of words with at least two of letters J, K, Q, X, Z
3.3. Number of words with at least 15 charactersNumber of words with at least 15 characters
4.4. Number of words with non-English characters, special Number of words with non-English characters, special characters such as punctuation, or digits at beginning or characters such as punctuation, or digits at beginning or middle of wordmiddle of word
5.5. Number of words with all letters in uppercaseNumber of words with all letters in uppercase
6.6. Binary feature indicating 3 or more repeated charactersBinary feature indicating 3 or more repeated characters
Features from priority and Features from priority and content-type headerscontent-type headers
1.1. Binary feature indicating whether the Binary feature indicating whether the priority had been set to any level priority had been set to any level besides normal or mediumbesides normal or medium
2.2. Binary feature indicating whether a Binary feature indicating whether a content-type header appeared within content-type header appeared within the message headers or whether the the message headers or whether the content type had been set to “text/html”content type had been set to “text/html”
Features from message bodyFeatures from message body
1.1. Proportion of alphabetic words with no vowels and at least 7 Proportion of alphabetic words with no vowels and at least 7 characterscharacters
2.2. Proportion of alphabetic words with at lease two of letters J, Proportion of alphabetic words with at lease two of letters J, K, Q, X, ZK, Q, X, Z
3.3. Proportion of alphabetic words at least 15 characters longProportion of alphabetic words at least 15 characters long4.4. Binary feature indicating whether the strings “From:” and Binary feature indicating whether the strings “From:” and
“To:” were both present“To:” were both present5.5. Number of HTML opening comment tagsNumber of HTML opening comment tags6.6. Number of hyperlinks (“href=“)Number of hyperlinks (“href=“)7.7. Number of clickable images represented in HTMLNumber of clickable images represented in HTML8.8. Binary feature indicating whether a text color was set to whiteBinary feature indicating whether a text color was set to white9.9. Number of URLs in hyperlinks with digits or “&”, “%”, or “@”Number of URLs in hyperlinks with digits or “&”, “%”, or “@”
Neural Network spam classifierNeural Network spam classifier
3-layer, feed-forward network (Perceptron)3-layer, feed-forward network (Perceptron)
– 17 input units, variable # hidden layer units, 1 output unit17 input units, variable # hidden layer units, 1 output unit
Data – 1,654 e-mails: 854 spam, 800 legitimateData – 1,654 e-mails: 854 spam, 800 legitimate
Use half of each (spam/non-spam) for training, Use half of each (spam/non-spam) for training, the other half for testingthe other half for testing
Test with variations of hidden nodes (4 to 14) Test with variations of hidden nodes (4 to 14) and epochs (100 to 500)and epochs (100 to 500)
Definitions used for classifier Definitions used for classifier success measuressuccess measures
nnSS SS = number of spam classified as spam
nnSL SL = number of spam classified as legitimate
nnLL LL = number of legitimate classified as legitimate
nnLS LS = number of legitimate classified as spam
Measure of success: precisionMeasure of success: precision
Precision: the percentage of labeled Precision: the percentage of labeled spam/legitimate e-mail correctly classifiedspam/legitimate e-mail correctly classified
legitimatelabeledall
legitimatecorrectLPprecisionLegitimate
spamlabeledall
spamscorrectSPprecisionSpam
)(
)(
Measure of success: precisionMeasure of success: precision
Precision: the percentage of labeled Precision: the percentage of labeled spam/legitimate e-mail correctly classifiedspam/legitimate e-mail correctly classified
SLLL
LL
LSSS
SS
nn
nLPprecisionLegitimate
nn
nSPprecisionSpam
)(
)(
Measure of success: accuracyMeasure of success: accuracy
Accuracy: the percentage of actual Accuracy: the percentage of actual spam/legitimate e-mail correctly classifiedspam/legitimate e-mail correctly classified
legitimateactualall
legitimatecorrectLRrecallLegitimate
spamactualall
spamscorrectSRrecallSpam
)(
)(
Measure of success: accuracyMeasure of success: accuracy
Accuracy: the percentage of actual Accuracy: the percentage of actual spam/legitimate e-mail correctly classifiedspam/legitimate e-mail correctly classified
LSLL
LL
SLSS
SS
nn
nLRrecallLegitimate
nn
nSRrecallSpam
)(
)(
Neural Network resultsNeural Network results
Best overall results with 12 hidden nodes at Best overall results with 12 hidden nodes at 500 epochs500 epochs– Spam Precision: 92.45%Spam Precision: 92.45%– Legitimate Precision: 91.32%Legitimate Precision: 91.32%– Spam Accuracy: 91.80%Spam Accuracy: 91.80%– Legitimate Accuracy : 92.00%Legitimate Accuracy : 92.00%
35 spams misclassified: 8.20%35 spams misclassified: 8.20% 32 legitimates misclassified: 8.00%32 legitimates misclassified: 8.00%
Misclassified e-mailsMisclassified e-mails
Most spam misclassified as legitimate Most spam misclassified as legitimate were short in length, with few hyperlinkswere short in length, with few hyperlinks
Most legitimate e-mails misclassified as Most legitimate e-mails misclassified as spam had unusual features for personal spam had unusual features for personal e-mail (that is, they were “spam-like” in e-mail (that is, they were “spam-like” in appearance)appearance)
Comparing Neural Network and Comparing Neural Network and Naïve Bayesian ClassifiersNaïve Bayesian Classifiers
Accuracy of the NN classifier is comparable to Accuracy of the NN classifier is comparable to that reported for Naïve Bayesian classifiersthat reported for Naïve Bayesian classifiers
NN classifier required fewer features (17 versus NN classifier required fewer features (17 versus 100 in one study and 500 in another)100 in one study and 500 in another)
NN classifier uses descriptive qualities of words NN classifier uses descriptive qualities of words and messages similar to those used by human and messages similar to those used by human readersreaders
Blacklisting ExperimentBlacklisting Experiment
Manually entered IP addresses of e-mail Manually entered IP addresses of e-mail incorrectly tagged by NN classifierincorrectly tagged by NN classifier– Entered first (original) IP address and, when present, Entered first (original) IP address and, when present,
second IP address (e.g., mail server or ISP) second IP address (e.g., mail server or ISP) Into a website that sends IP addresses to 173 Into a website that sends IP addresses to 173
working spam blacklists and returns the # hits, working spam blacklists and returns the # hits, http://www.declude.com/junkmail/support/ip4r.htmhttp://www.declude.com/junkmail/support/ip4r.htm
Counted only hit counts greater than one as spam Counted only hit counts greater than one as spam since single-list hits to be anomalies since single-list hits to be anomalies
Blacklisting Experimental ResultsBlacklisting Experimental Results
Of the 32 Of the 32 legitimatelegitimate e-mails misclassified e-mails misclassified by the NN, 53% were identified as spamby the NN, 53% were identified as spam
Of the 35 Of the 35 spamspam e-mails misclassified by e-mails misclassified by the NN, 97% were identified as spamthe NN, 97% were identified as spam
These poor results indicate that the These poor results indicate that the blacklisting strategy, at least for these blacklisting strategy, at least for these databases, is inadequate databases, is inadequate
ConclusionsConclusions
NN competitive to Naïve Bayesian studies NN competitive to Naïve Bayesian studies despite using a much smaller feature setdespite using a much smaller feature set
Room for refinement of parsing for featuresRoom for refinement of parsing for features Use of descriptive, more human-like Use of descriptive, more human-like
features makes NN less subject to features makes NN less subject to degradation than Naïve Bayesiandegradation than Naïve Bayesian
Conclusions (cont.)Conclusions (cont.)
Neural Network approach is useful and Neural Network approach is useful and accurate, but too many legitimate -> spamaccurate, but too many legitimate -> spam
Should be powerful when used in Should be powerful when used in conjunction with a whitelist to reduce conjunction with a whitelist to reduce legitimate -> spam (nlegitimate -> spam (nLSLS), increasing spam ), increasing spam
precision and legitimate accuracyprecision and legitimate accuracy Blacklisting strategy is not very helpfulBlacklisting strategy is not very helpful