copyright 2004, david d. lewis (naive) bayesian text classification for spam filtering david d....

31
Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting www.daviddlewis.com Presented at ASA Chicago Chapter Spring Conference., Loyola Univ., May 7, 2004.

Upload: rafe-terry

Post on 23-Dec-2015

242 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

(Naive) Bayesian Text Classification for Spam

Filtering

David D. Lewis, Ph.D.

Ornarose, Inc.

& David D. Lewis Consulting

www.daviddlewis.com

Presented at ASA Chicago Chapter Spring Conference., Loyola Univ.,

May 7, 2004.

Page 2: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

MenuSpam

Spam FilteringClassification for Spam Filtering

Classification

Bayesian ClassificationNaive Bayesian Classification

Naive Bayesian Text ClassificationNaive Bayesian Text Classification for Spam Filtering

(Feature Extraction for) Spam Filtering Text Classification (for Marketing)

(Better) Bayesian Classification

Page 3: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Spam

• Unsolicited bulk email– or, in practice, whatever email you don’t want

• Large fraction of all email sent– Brightmail est. 64%, Postini est. 77%– Still growing

• Est. cost to US businesses exceeded $30 billion in Y2003

Page 4: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Approaches to Spam Control

• Economic (email pricing, ...)

• Legal (CANSPAM, ...)

• Societal pressure (trade groups, ...)

• Securing infrastructure (email servers, ...)

• Authentication (challenge/response,...)

• Filtering

Page 5: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Spam Filtering

• Intensional (feature-based) vs. Extensional (white/blacklist-based)

• Applied at sender vs. receiver

• Applied at email client vs. mail server vs. ISP

Page 6: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Statistical Classification

1. Define classes of objects

2. Specify probability distribution model connecting classes to observable features

3. Fit parameters of model to data

4. Observe features on inputs and compute probability of class membership

5. Assign object to a class

Page 7: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Classifier Inter- preter

CLASSIFIERCLASSIFIER

FeatureExtraction

Page 8: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

• Extract features from header, content

• Train classifier

• Classify message and process:– Block message, insert tag, put in folder, etc.

Classification for Spam Filtering

vs. vs.

• Define classes:

Page 9: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Two Classes of Classifier

• Generative: Naive Bayes, LDA,...– Model joint distribution of class and features– Derive class probability by Bayes rule

• Discriminative: logistic regression, CART,...– Model conditional distribution of class given

known feature values– Model directly estimates class probability

Page 10: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

2. Specify probability model2b. And prior distribution over parameters

3. Find posterior distribution of model parameters, given data

4. Compute class probabilities using posterior distribution (or element of it)

5. Classify object

Bayesian Classification (1)

1. Define classes

Page 11: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Bayesian Classification (2)

• = “Naive”/”Idiot”/”Simple” Bayes

• A particular generative model – Assumes independence of observable features

within each class of messages– Bayes rule used to compute class probability

• Might or might not use a prior on model parameters

Page 12: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Naive Bayes for Text Classification - History

• Maron (JACM, 1961) – automated indexing• Mosteller and Wallace (1964) – author

identification• Van Rijsbergen, Robertson, Sparck Jones,

Croft, Harper (early 1970’s) – search engines

• Sahami, Dumais, Heckerman, Horvitz (1998) – spam filtering

Page 13: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

• Graham’s A Plan for Spam– And its mutant offspring...

• Naive Bayes-like classifier with weird parameter estimation

• Widely used in spam filters – Classic Naive Bayes superior when

appropriately used

Bayesian Classification (3)

Page 14: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

NB & Friends: Advantages

• Simple to implement– No numerical optimization, matrix algebra, etc.

• Efficient to train and use– Fitting = computing means of feature values– Easy to update with new data– Equivalent to linear classifier, so fast to apply

• Binary or polytomous

Page 15: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

NB & Friends: Advantages

• Independence allows parameters to be estimated on different data sets, e.g. – Estimate content features from messages with

headers omitted– Estimate header features from messages with

content missing

Page 16: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

NB & Friends: Advantages

• Generative model– Comparatively good effectiveness with small

training sets– Unlabeled data can be used in parameter

estimation (in theory)

Page 17: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

NB & Friends: Disadvantages

• Independence assumption wrong– Absurd estimates of class probabilities– Threshold must be tuned, not set analytically

• Generative model– Generally lower effectiveness than

discriminative techniques (e.g. log. regress.)– Improving parameter estimates can hurt

classification effectiveness

Page 18: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Feature Extraction

• Convert message to feature vector

• Header: sender, recipient, routing,…– Possibly break up domain names

• Text– Words, phrases, character strings– Become binary or numeric features

• URLs, HTML tags, images,…

Page 19: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Page 20: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Page 21: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

From: Sam Elegy <[email protected]>To: [email protected]: you can buy V!@gra

Spamlike content in image form

Irrelevant legit content; doubles as hash buster

Typographic variations

Randomly generated name and email

Page 22: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Defeating Feature Extraction

• Misspellings, character set choice, HTML games: mislead extraction of words

• Put content in images• Forge headers (to avoid identification, but

also interferes with classification)• Innocuous content to mimic distribution in

nonspam• Hashbusters (zyArh73Gf) clog dictionaries

Page 23: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Survival of the Fittest

• Filter designers get to see spam

• Spammers use spam filters

• Unprecedented arms race for a statistical field

• Countermeasures mostly target feature extraction, not modeling assumptions

Page 24: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Miscellany

1. Getting legitimate bulk mail past spam filters

2. Other uses of text classification in marketing

3. Frontiers in Bayesian classification

Page 25: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Getting Legit Bulk Email Past Filters

• Test email against several filters– Send to accounts on multiple ISPs– Multiple client-based filters if particularly

concerned

• Coherent content, correctly spelled• Non-tricky headers and markup • Avoid spam keywords where possible • Don’t use spammer tricks

Page 26: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Text Classification in Marketing

• Routing incoming email– Responses to promotions– Detect opportunities for selling– (Automated response sometimes possible)

• Analysis of text/mixed data on customers– e.g. customer or CSR comments

• Content analysis– Focus groups, email, chat, blogs, news,…

Page 27: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Better Bayesian Classification

• Discriminative– Logistic regression with informative priors– Sharing strength across related problems– Calibration and confidence of predictions

• Generative – Bayesian networks/graphical models– Use of unlabeled and partially labeled data

• Hybrid

Page 28: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

BBR

• Logistic regression w/ informative priors– Gaussian = ridge logistic regression– Laplace = lasso logistic regression

• Sparse data structures & fast optimizer– 10^4 cases, 10^5 predictors, few seconds!

• Accuracy competitive with SVMs • Free for research use

– www.stat.rutgers.edu/~madigan/BBR/

• Joint work w/ Madigan & Genkin (Rutgers)

Page 29: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Gaussian Laplace

Gaussian vs. Laplace Prior

Page 30: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Future of Spam Filtering

• More attention to training data selection, personalization

• Image processing • Robustness against word variations• More linguistic sophistication• Replacing naive Bayes with better learners

• Keep hoping for economic cure

Page 31: Copyright 2004, David D. Lewis (Naive) Bayesian Text Classification for Spam Filtering David D. Lewis, Ph.D. Ornarose, Inc. & David D. Lewis Consulting

Copyright 2004, David D. Lewis

Summary

• By volume, spam filtering is easily the biggest application of text classification– Possible of supervised learning

• Filters have helped a lot– Naive Bayes is just a starting point

• Other interesting applications of Bayesian classification