a neural network classifier for junk e-mail ian stuart, sung-hyuk cha, and charles tappert csis...

26
A Neural Network A Neural Network Classifier for Junk Classifier for Junk E-Mail E-Mail Ian Stuart, Sung-Hyuk Cha, and Ian Stuart, Sung-Hyuk Cha, and Charles Tappert Charles Tappert CSIS Student/Faculty Research Day CSIS Student/Faculty Research Day May 7, 2004 May 7, 2004

Upload: stuart-gaines

Post on 30-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

A Neural Network A Neural Network Classifier for Junk E-MailClassifier for Junk E-Mail

Ian Stuart, Sung-Hyuk Cha, and Charles TappertIan Stuart, Sung-Hyuk Cha, and Charles TappertCSIS Student/Faculty Research DayCSIS Student/Faculty Research Day

May 7, 2004May 7, 2004

Spam, spam, spam, …Spam, spam, spam, …

Fighting spamFighting spam

Several commercial applications existSeveral commercial applications exist– Server-side: expensiveServer-side: expensive– Client-side: time-consumingClient-side: time-consuming

No approach is 100% effectiveNo approach is 100% effective– Spammers are aggressive and adaptableSpammers are aggressive and adaptable– Best solutions are typically hybrids of Best solutions are typically hybrids of

different approaches and criteriadifferent approaches and criteria

Common approachesCommon approaches

Simple filtersSimple filters– Common words or phrasesCommon words or phrases– Unusual punctuation or capitalizationUnusual punctuation or capitalization

Blacklisting: “just say NO” (if you can)Blacklisting: “just say NO” (if you can)

– Reject e-mail from known spammersReject e-mail from known spammers

Whitelisting: “friends only, please”Whitelisting: “friends only, please”

– Accept e-mail only from known correspondentsAccept e-mail only from known correspondents

Classifiers: examine each e-mail and decideClassifiers: examine each e-mail and decide– Only a few publications on spam classifiersOnly a few publications on spam classifiers

Naïve Bayesian classifiersNaïve Bayesian classifiers

Used in commercial classifiers Used in commercial classifiers Assumes recognition features are independentAssumes recognition features are independent

– Max likelihood = product of likelihoods of featuresMax likelihood = product of likelihoods of features E-mail classifier – examines each wordE-mail classifier – examines each word

– Training assigns a probability to each wordTraining assigns a probability to each word– Look up each word/probability in a dictionaryLook up each word/probability in a dictionary– If the product of the probabilities exceeds a given If the product of the probabilities exceeds a given

threshold, it is spamthreshold, it is spam Challenge – creating the “dictionary”Challenge – creating the “dictionary” We compare our Neural Network against two We compare our Neural Network against two

published Naïve Bayesian classifierspublished Naïve Bayesian classifiers

Naïve Bayesian classifier issuesNaïve Bayesian classifier issues

How many features (words), which ones?How many features (words), which ones? How is degradation avoided as spammers’ How is degradation avoided as spammers’

vocabulary changes?vocabulary changes? What values are assigned to new words?What values are assigned to new words? What are the thresholds?What are the thresholds? How to avoid “sabotage” of classifier?How to avoid “sabotage” of classifier?

Which one isn’t spam?Which one isn’t spam?(subject headers)(subject headers)

5 Be a mighty warrior in bed! vcrhwt ygjztyjjh5 Be a mighty warrior in bed! vcrhwt ygjztyjjh

Money Back Guarantee_HGHMoney Back Guarantee_HGH

kindle life pddez liw mzackindle life pddez liw mzac

v a l i u m - D i a z e p a m used to relieve anxietyv a l i u m - D i a z e p a m used to relieve anxiety

Fairfield tennis scheduleFairfield tennis schedule

:Dramatic E,nhancement fo=r .Men = f"fumqid:Dramatic E,nhancement fo=r .Men = f"fumqid

,Refina'nce now. Don't wait,Refina'nce now. Don't wait

Which one isn’t spam?Which one isn’t spam? (subject headers) (subject headers)

5 Be a mighty warrior in bed! vcrhwt ygjztyjjh5 Be a mighty warrior in bed! vcrhwt ygjztyjjh

Money Back Guarantee_HGHMoney Back Guarantee_HGH

kindle life pddez liw mzackindle life pddez liw mzac

v a l i u m - D i a z e p a m used to relieve anxietyv a l i u m - D i a z e p a m used to relieve anxiety

Fairfield tennis scheduleFairfield tennis schedule

:Dramatic E,nhancement fo=r .Men = f"fumqid:Dramatic E,nhancement fo=r .Men = f"fumqid

,Refina'nce now. Don't wait,Refina'nce now. Don't wait

Spammers make patternsSpammers make patterns

The more they try to hide, the easier it The more they try to hide, the easier it is to see themis to see them

Therefore, we use common spammer Therefore, we use common spammer patterns (instead of vocabulary) as patterns (instead of vocabulary) as features for classificationfeatures for classification

Learn these patterns with a Neural Learn these patterns with a Neural NetworkNetwork

Neural Network featuresNeural Network features

Total of 17 featuresTotal of 17 features– 6 from the subject header6 from the subject header

– 2 from priority and content-type headers2 from priority and content-type headers

– 9 from the e-mail body9 from the e-mail body

Features from subject headerFeatures from subject header

1.1. Number of words with no vowelsNumber of words with no vowels

2.2. Number of words with at least two of letters J, K, Q, X, ZNumber of words with at least two of letters J, K, Q, X, Z

3.3. Number of words with at least 15 charactersNumber of words with at least 15 characters

4.4. Number of words with non-English characters, special Number of words with non-English characters, special characters such as punctuation, or digits at beginning or characters such as punctuation, or digits at beginning or middle of wordmiddle of word

5.5. Number of words with all letters in uppercaseNumber of words with all letters in uppercase

6.6. Binary feature indicating 3 or more repeated charactersBinary feature indicating 3 or more repeated characters

Features from priority and Features from priority and content-type headerscontent-type headers

1.1. Binary feature indicating whether the Binary feature indicating whether the priority had been set to any level priority had been set to any level besides normal or mediumbesides normal or medium

2.2. Binary feature indicating whether a Binary feature indicating whether a content-type header appeared within content-type header appeared within the message headers or whether the the message headers or whether the content type had been set to “text/html”content type had been set to “text/html”

Features from message bodyFeatures from message body

1.1. Proportion of alphabetic words with no vowels and at least 7 Proportion of alphabetic words with no vowels and at least 7 characterscharacters

2.2. Proportion of alphabetic words with at lease two of letters J, Proportion of alphabetic words with at lease two of letters J, K, Q, X, ZK, Q, X, Z

3.3. Proportion of alphabetic words at least 15 characters longProportion of alphabetic words at least 15 characters long4.4. Binary feature indicating whether the strings “From:” and Binary feature indicating whether the strings “From:” and

“To:” were both present“To:” were both present5.5. Number of HTML opening comment tagsNumber of HTML opening comment tags6.6. Number of hyperlinks (“href=“)Number of hyperlinks (“href=“)7.7. Number of clickable images represented in HTMLNumber of clickable images represented in HTML8.8. Binary feature indicating whether a text color was set to whiteBinary feature indicating whether a text color was set to white9.9. Number of URLs in hyperlinks with digits or “&”, “%”, or “@”Number of URLs in hyperlinks with digits or “&”, “%”, or “@”

Neural Network spam classifierNeural Network spam classifier

3-layer, feed-forward network (Perceptron)3-layer, feed-forward network (Perceptron)

– 17 input units, variable # hidden layer units, 1 output unit17 input units, variable # hidden layer units, 1 output unit

Data – 1,654 e-mails: 854 spam, 800 legitimateData – 1,654 e-mails: 854 spam, 800 legitimate

Use half of each (spam/non-spam) for training, Use half of each (spam/non-spam) for training, the other half for testingthe other half for testing

Test with variations of hidden nodes (4 to 14) Test with variations of hidden nodes (4 to 14) and epochs (100 to 500)and epochs (100 to 500)

Definitions used for classifier Definitions used for classifier success measuressuccess measures

nnSS SS = number of spam classified as spam

nnSL SL = number of spam classified as legitimate

nnLL LL = number of legitimate classified as legitimate

nnLS LS = number of legitimate classified as spam

Measure of success: precisionMeasure of success: precision

Precision: the percentage of labeled Precision: the percentage of labeled spam/legitimate e-mail correctly classifiedspam/legitimate e-mail correctly classified

legitimatelabeledall

legitimatecorrectLPprecisionLegitimate

spamlabeledall

spamscorrectSPprecisionSpam

)(

)(

Measure of success: precisionMeasure of success: precision

Precision: the percentage of labeled Precision: the percentage of labeled spam/legitimate e-mail correctly classifiedspam/legitimate e-mail correctly classified

SLLL

LL

LSSS

SS

nn

nLPprecisionLegitimate

nn

nSPprecisionSpam

)(

)(

Measure of success: accuracyMeasure of success: accuracy

Accuracy: the percentage of actual Accuracy: the percentage of actual spam/legitimate e-mail correctly classifiedspam/legitimate e-mail correctly classified

legitimateactualall

legitimatecorrectLRrecallLegitimate

spamactualall

spamscorrectSRrecallSpam

)(

)(

Measure of success: accuracyMeasure of success: accuracy

Accuracy: the percentage of actual Accuracy: the percentage of actual spam/legitimate e-mail correctly classifiedspam/legitimate e-mail correctly classified

LSLL

LL

SLSS

SS

nn

nLRrecallLegitimate

nn

nSRrecallSpam

)(

)(

Neural Network resultsNeural Network results

Best overall results with 12 hidden nodes at Best overall results with 12 hidden nodes at 500 epochs500 epochs– Spam Precision: 92.45%Spam Precision: 92.45%– Legitimate Precision: 91.32%Legitimate Precision: 91.32%– Spam Accuracy: 91.80%Spam Accuracy: 91.80%– Legitimate Accuracy : 92.00%Legitimate Accuracy : 92.00%

35 spams misclassified: 8.20%35 spams misclassified: 8.20% 32 legitimates misclassified: 8.00%32 legitimates misclassified: 8.00%

Misclassified e-mailsMisclassified e-mails

Most spam misclassified as legitimate Most spam misclassified as legitimate were short in length, with few hyperlinkswere short in length, with few hyperlinks

Most legitimate e-mails misclassified as Most legitimate e-mails misclassified as spam had unusual features for personal spam had unusual features for personal e-mail (that is, they were “spam-like” in e-mail (that is, they were “spam-like” in appearance)appearance)

Comparing Neural Network and Comparing Neural Network and Naïve Bayesian ClassifiersNaïve Bayesian Classifiers

Accuracy of the NN classifier is comparable to Accuracy of the NN classifier is comparable to that reported for Naïve Bayesian classifiersthat reported for Naïve Bayesian classifiers

NN classifier required fewer features (17 versus NN classifier required fewer features (17 versus 100 in one study and 500 in another)100 in one study and 500 in another)

NN classifier uses descriptive qualities of words NN classifier uses descriptive qualities of words and messages similar to those used by human and messages similar to those used by human readersreaders

Blacklisting ExperimentBlacklisting Experiment

Manually entered IP addresses of e-mail Manually entered IP addresses of e-mail incorrectly tagged by NN classifierincorrectly tagged by NN classifier– Entered first (original) IP address and, when present, Entered first (original) IP address and, when present,

second IP address (e.g., mail server or ISP) second IP address (e.g., mail server or ISP) Into a website that sends IP addresses to 173 Into a website that sends IP addresses to 173

working spam blacklists and returns the # hits, working spam blacklists and returns the # hits, http://www.declude.com/junkmail/support/ip4r.htmhttp://www.declude.com/junkmail/support/ip4r.htm

Counted only hit counts greater than one as spam Counted only hit counts greater than one as spam since single-list hits to be anomalies since single-list hits to be anomalies

Blacklisting Experimental ResultsBlacklisting Experimental Results

Of the 32 Of the 32 legitimatelegitimate e-mails misclassified e-mails misclassified by the NN, 53% were identified as spamby the NN, 53% were identified as spam

Of the 35 Of the 35 spamspam e-mails misclassified by e-mails misclassified by the NN, 97% were identified as spamthe NN, 97% were identified as spam

These poor results indicate that the These poor results indicate that the blacklisting strategy, at least for these blacklisting strategy, at least for these databases, is inadequate databases, is inadequate

ConclusionsConclusions

NN competitive to Naïve Bayesian studies NN competitive to Naïve Bayesian studies despite using a much smaller feature setdespite using a much smaller feature set

Room for refinement of parsing for featuresRoom for refinement of parsing for features Use of descriptive, more human-like Use of descriptive, more human-like

features makes NN less subject to features makes NN less subject to degradation than Naïve Bayesiandegradation than Naïve Bayesian

Conclusions (cont.)Conclusions (cont.)

Neural Network approach is useful and Neural Network approach is useful and accurate, but too many legitimate -> spamaccurate, but too many legitimate -> spam

Should be powerful when used in Should be powerful when used in conjunction with a whitelist to reduce conjunction with a whitelist to reduce legitimate -> spam (nlegitimate -> spam (nLSLS), increasing spam ), increasing spam

precision and legitimate accuracyprecision and legitimate accuracy Blacklisting strategy is not very helpfulBlacklisting strategy is not very helpful