using naive bayes to detect spammy names in social networks

10
Using Naive Bayes to Detect Spammy Names in Social Networks David Mandell Freeman LinkedIn Corporation 2029 Stierlin Ct. Mountain View, CA 94043 USA [email protected] ABSTRACT Many social networks are predicated on the assumption that a member’s online information reflects his or her real iden- tity. In such networks, members who fill their name fields with fictitious identities, company names, phone numbers, or just gibberish are violating the terms of service, pollut- ing search results, and degrading the value of the site to real members. Finding and removing these accounts on the basis of their spammy names can both improve the site experience for real members and prevent further abusive activity. In this paper we describe a set of features that can be used by a Naive Bayes classifier to find accounts whose names do not represent real people. The model can detect both auto- mated and human abusers and can be used at registration time, before other signals such as social graph or clickstream history are present. We use member data from LinkedIn to train and validate our model and to choose parameters. Our best-scoring model achieves AUC 0.85 on a sequestered test set. We ran the algorithm on live LinkedIn data for one month in parallel with our previous name scoring algorithm based on regular expressions. The false positive rate of our new algorithm (3.3%) was less than half that of the previous algorithm (7.0%). When the algorithm is run on email usernames as well as user-entered first and last names, it provides an effective way to catch not only bad human actors but also bots that have poor name and email generation algorithms. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Informa- tion Search and Retrieval — Spam; I.2.6 [Artificial Intel- ligence]: Learning Keywords Social networks, spam detection, Naive Bayes classifier Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. AISec’13, November 4, 2013, Berlin, Germany. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2488-5/13/11 ...$15.00. http://dx.doi.org/10.1145/2517312.2517314 1. INTRODUCTION One of the fundamental characteristics of many social net- works is that a user’s online identity is assumed to be an accurate reflection of his or her real identity. For exam- ple, the terms of service of Facebook [13], LinkedIn [20], and Google+ [15] all require users to identify themselves by their real names. However, many users still sign up for these social networks with fake names. Often these users intend to engage in abusive activity, such as sending spam messages or posting malicious links. These users may use automation to create accounts, in which case fake names are generated from a predefined dictionary or even as random strings of characters. In addition to the dedicated spammers, many users with no intention of abuse will fill in their personal information with something other than (or in addition to) their real name. Some users simply cannot be bothered to take the time to enter their personal information, and fill in name fields with strings such as “asdf,” while others may choose to mask their identities out of privacy concerns. Still others may use the name fields to enter job titles, phone numbers, email addresses, or other non-name information, either out of hopes that this information will be surfaced more eas- ily in searches or out of misunderstanding of the field labels. These accounts, even if not actively abusing the site, are still undesirable, as they may pollute search results and degrade the value of the site to members who abide by the terms of service. Many previous authors have considered the problem of de- tecting spam and fake accounts in social networks, using sig- nals such as clickstream patterns [30, 29], message-sending activity and content [3, 14], and properties of the social graph [8, 7, 6]. However, to our knowledge none have con- sidered the problem of detecting fake accounts from name information only. An approach using only name data has several appealing features: It can be used to detect both automated and human abusers and both malicious and non-malicous fakes. It can be used at registration time, when other signals such as clickstream history and connections graph are not present. It can detect fakes preemptively rather than waiting for the first round of abuse to occur. Furthermore, finding fake names can be used as a comple- mentary strategy to existing spam-detection methods and

Upload: lebao

Post on 23-Jan-2017

216 views

Category:

Documents


1 download

TRANSCRIPT

Using Naive Bayes to DetectSpammy Names in Social Networks

David Mandell FreemanLinkedIn Corporation

2029 Stierlin Ct.Mountain View, CA 94043 [email protected]

ABSTRACTMany social networks are predicated on the assumption thata member’s online information reflects his or her real iden-tity. In such networks, members who fill their name fieldswith fictitious identities, company names, phone numbers,or just gibberish are violating the terms of service, pollut-ing search results, and degrading the value of the site to realmembers. Finding and removing these accounts on the basisof their spammy names can both improve the site experiencefor real members and prevent further abusive activity.

In this paper we describe a set of features that can be usedby a Naive Bayes classifier to find accounts whose names donot represent real people. The model can detect both auto-mated and human abusers and can be used at registrationtime, before other signals such as social graph or clickstreamhistory are present. We use member data from LinkedIn totrain and validate our model and to choose parameters. Ourbest-scoring model achieves AUC 0.85 on a sequestered testset.

We ran the algorithm on live LinkedIn data for one monthin parallel with our previous name scoring algorithm basedon regular expressions. The false positive rate of our newalgorithm (3.3%) was less than half that of the previousalgorithm (7.0%).

When the algorithm is run on email usernames as wellas user-entered first and last names, it provides an effectiveway to catch not only bad human actors but also bots thathave poor name and email generation algorithms.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: Informa-tion Search and Retrieval — Spam; I.2.6 [Artificial Intel-ligence]: Learning

KeywordsSocial networks, spam detection, Naive Bayes classifier

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’13, November 4, 2013, Berlin, Germany.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2488-5/13/11 ...$15.00.http://dx.doi.org/10.1145/2517312.2517314

1. INTRODUCTIONOne of the fundamental characteristics of many social net-

works is that a user’s online identity is assumed to be anaccurate reflection of his or her real identity. For exam-ple, the terms of service of Facebook [13], LinkedIn [20],and Google+ [15] all require users to identify themselves bytheir real names. However, many users still sign up for thesesocial networks with fake names. Often these users intend toengage in abusive activity, such as sending spam messagesor posting malicious links. These users may use automationto create accounts, in which case fake names are generatedfrom a predefined dictionary or even as random strings ofcharacters.

In addition to the dedicated spammers, many users withno intention of abuse will fill in their personal informationwith something other than (or in addition to) their realname. Some users simply cannot be bothered to take thetime to enter their personal information, and fill in namefields with strings such as “asdf,” while others may chooseto mask their identities out of privacy concerns. Still othersmay use the name fields to enter job titles, phone numbers,email addresses, or other non-name information, either outof hopes that this information will be surfaced more eas-ily in searches or out of misunderstanding of the field labels.These accounts, even if not actively abusing the site, are stillundesirable, as they may pollute search results and degradethe value of the site to members who abide by the terms ofservice.

Many previous authors have considered the problem of de-tecting spam and fake accounts in social networks, using sig-nals such as clickstream patterns [30, 29], message-sendingactivity and content [3, 14], and properties of the socialgraph [8, 7, 6]. However, to our knowledge none have con-sidered the problem of detecting fake accounts from nameinformation only.

An approach using only name data has several appealingfeatures:

• It can be used to detect both automated and humanabusers and both malicious and non-malicous fakes.

• It can be used at registration time, when other signalssuch as clickstream history and connections graph arenot present.

• It can detect fakes preemptively rather than waitingfor the first round of abuse to occur.

Furthermore, finding fake names can be used as a comple-mentary strategy to existing spam-detection methods and

can be used as part of a co-training process [5] to create abroad model for fake accounts.

1.1 Our contributionWe propose a robust, language-agnostic algorithm for spam

detection in user-entered names and other very small texts.Specifically, we describe a set of features that can be ex-tracted from such texts and used in a Naive Bayes classifierapplied to these texts, as well as some algorithmic tweaksthat improve the classifer’s performance. We demonstratethe effectiveness of our classifier by running the algorithmon LinkedIn members’ name data, including both a labeledvalidation set and unlabeled data from production, and com-puting several performance metrics. Implementing the algo-rithm on production data cut the false positive rate by morethan half as compared with LinkedIn’s previous method ofname spam detection. We also show that running the algo-rithm on email usernames can supplement the name spamdetection by correctly classifying some borderline cases.

1.2 Our ApproachWe build our name-spam detection algorithm using a Naive

Bayes classifier. This classifier has a long and successful his-tory of dealing with the problem of email spam [25, 2, 16,26, 27, 1]. However, all proposed uses of Naive Bayes forspam classification use individual words or groups of wordsas features. Using individual names as features would not bean effective means of detecting name spam. For example, ifeach name consisted of a single word, the Naive Bayes classi-fier would be equivalent to determining whether the percent-age of fake accounts with a given name was above or belowsome predetermined threshold. Even more problematically,a whole-name-based classifier would be powerless to classifya name it had not seen before; all unique names would haveto share the same classification, a significant problem on aglobal social network such as LinkedIn, where more than onethird of all accounts have unique names [18].

Instead of using whole words, we use as our base featureset n-grams of letters that comprise these words. Specifi-cally, for various small values of n we extract from an m-character name the (m − n + 1) substrings of n consecu-tive characters. For example, with n = 3 the name Davidbreaks into the three n-grams (Dav, avi, vid). We then aug-ment the feature set by adding phantom “start-of-word” and“end-of-word” characters (represented by \^ and \$, respec-tively) to each name and incorporating these into our n-grams; under this tranformation the 3-grams for David are(\^Da, dav, avi, vid, id\$).

1.3 Missing FeaturesIt may occur that a name we wish to classify contains an

n-gram that does not appear in our training set. There aretwo standard approaches to dealing with missing features inNaive Bayes: to ignore them, or to assign a value to a “miss-ing feature” feature. We consider both options. To assign avalue to the “missing feature” feature we borrow a techniquefrom natural language processing:we randomly divide thetraining set in two and add up the category counts for fea-tures that appear in only one half of the set. Contrary to theresults of Kohavi, Becker, and Sommerfield [17], we find thatadding the “missing feature” feature significantly improvesour results. More generally, we can apply the “missing fea-ture” feature to any feature with fewer than c instances in

the training set, with the optimal value of c determined ex-perimentally on a validation set.

To improve further our treatment of missing features, weobserve that for an n-gram that does not appear in the train-ing set it may be the case the the two component (n − 1)-grams do appear in the training set. In this case we can re-place the missing n-gram feature with the two (n− 1)-gramfeatures; if one or both of these features are missing we caniterate recursively, up to the point where we are consider-ing 1-grams. We find that this treatment gives a significantperformance boost.

1.4 Experimental ResultsWe trained our classifier on a representative sample of

60 million LinkedIn accounts, with spam labels provided bythe LinkedIn Security team. After optimizing parameterson a validation set of approximately 100,000 accounts, wemeasured the performance of two different versions of ourclassifier on a held-out test set of approximately 100,000accounts. Both the validation and test sets were designedto consist of roughly half spam accounts.

We tested two versions of our algorithm: a “full” versionusing 5-grams and a“lightweight”version using 3-grams thatuses less than 10% as much memory as the full version. Afteroptimizing our algorithm’s parameters, for the full algorithmwe obtained an AUC of 0.85 and a max F1/8-score of 0.92.For the lightweight version we obtained an AUC of 0.80 anda max F1/8-score of 0.93. Further statistics can be foundin Section 5. We consider F1/8-score instead of the stan-dard F1-score because the cost of a false positive (blockinga good member) is significantly higher than the cost of afalse negative (leaving a fake account alone).

We also used our algorithm to find several hundred falsenegatives (i.e., spam accounts that had not been caught) inour test set.

Finally, we ran our lightweight algorithm on the primaryemail addresses associated with accounts in our training andtest sets, and found a noticeable improvement when thename and email features were combined.

2. MULTINOMIAL NAIVE BAYESWe begin by reviewing the Naive Bayes classifier [12, 23].

The task of this classifier is to approximate an unknownfunction f : F → L, where F = F1 × · · · × Fm is the featurespace and L is the label space. In our application L is thebinary set {0, 1}. If X = (X1, . . . , Xm) is a random variabletaking values in F and Y is a random variable taking valuesin L, then f(x) can be estimated by computing py = p(Y =y | X = ~x) and outputting the value of y for which py ismaximized.

By Bayes’ theorem and the law of total probability wehave

p(Y = y | X = ~x) =p(Y = y) · p(X = ~x | Y = y)∑y p(Y = y) · p(X = ~x | Y = y)

.

(2.1)Now suppose we are trying to classify a text document thatis represented as an ordered sequence of words (x1, . . . , xm);i.e., the feature Fi represents the word in the ith position.The multinomial Naive Bayes classifier makes the assump-tion that each word is generated independently from a multi-

nomial distribution.1 Under this assumption, if we let θwybe the probability that the word w appears in class y, thenwe have

p(X = ~x | Y = y) =(∑fw)!∏

w fw!

∏w

(θwy)fw , (2.2)

where fw denotes the number of instances of word w in thevector (x1, . . . , xm) and the products are taken over all wordsw in the vocabulary.

If the classification is binary and Y takes values 0 and 1,then we can divide the numerator and denominator of (2.1)by the terms with y = 0 to get

s(~x) := p(Y = 1 | X = ~x) =

p(Y=1)p(Y=0)

eR(~x)

1 + p(Y=1)p(Y=0)

eR(~x), (2.3)

where

R(~x) = log

(p(X = ~x | Y = 1)

p(X = ~x | Y = 0)

)is the log of the conditional probability ratio for the featurevector ~x. Substituting (2.2) into the formula for R(~x) andtaking logarithms gives

R(~x) =∑w

fw · log

(θw1

θw0

). (2.4)

(Observe that in a short document almost all of the fw willbe equal to zero.)

Performing the classification task now comes down to com-puting estimates for the values p(Y = y) (the class priors)and the parameter estimates θwy. We estimate θwy fromtraining data as

θwy =Nw,y + αw,yNy +

∑w αw,y

, (2.5)

where Nw,y indicates the number of instances of word w oc-curring in class y in the training set and Ny is the sumof the Nw,y over all features. The parameter αw,y is asmoothing parameter and corresponds to adding αw,y imag-ined instances of feature w in category y to the training data.Setting a nonzero smoothing parameter prevents a featurewith examples in only one class from forcing the probabil-ity estimate s(~x) to be 0 or 1. Standard Laplace smoothingsets αw,y = 1 for all w and y; other techniques are to setαw,y to be a small real number δ independent of w or toset αw,y = δ/Nw,y. This last method is called interpolatedsmoothing by Chen and Goodman [9].

The Naive Bayes algorithm has demonstrated great suc-cess in classification problems (see e.g., [11, 24, 21]) eventhough the value s(~x) of (2.3) is often highly inaccurate asan estimate of probability [4, 31, 10]. However, if we inter-pret s(~x) as a score for the feature vector ~x, we can then usea validation set to choose a cutoff point for classifying thatdepends on the relative cost of misclassifying in each class.(See Section 5.1 for further discussion.)

Implementation of the algorithm, once a suitable trainingset has been obtained, is straightforward. As long as thenumber of features in the training set is no more than a few

1We choose this model over others such as multivariateBernoulli and multivariate Gauss [22] because results fromthe literature [26, 21] suggest that this model typically per-forms best on text classification.

million, the data required for classifying new instances canbe kept in main memory on an ordinary PC or run on adistributed system such as Hadoop. Thus once the data isloaded, the algorithm can quickly classify many instances.We observe that computing R(~x) rather than eR(~x) speedsup computation and eliminates underflow errors caused bymultiplying together many small values.

3. THE BASIC ALGORITHM

3.1 N-grams of LettersIn typical email spam classifiers using Naive Bayes (e.g., [25,

26, 1]) the features w consist of words or phrases in a docu-ment, with separation indicated by whitespace characters orpunctuation. For short texts such as names these featuresare clearly inadequate — most names will only have twofeatures! Thus instead of entire words, we use n-grams ofletters as our main features. Formally, we have the following:

Definition 3.1. Let W = c1c2 · · · cm be an m-characterstring. We define n-grams(W ) to be the (m − n + 1) sub-strings of n consecutive characters in W :

n-grams(W ) = (c1c2 · · · cn, c2c3 · · · cn+1, . . . ,

cm−n+1cm−n+2 · · · cm) .

For example, with n = 3 the name “David” breaks intothe three n-grams (Dav, avi, vid).

Our choice of overlapping n-grams as features blatantlyviolates the Naive Bayes assumption that the features usedin classification are independent. However, as observed bynumerous previous authors (e.g. [10]), the classifier may stillhave good performance despite the dependencies among fea-tures; see Section 5.2 for further discussion.

Since names on LinkedIn consist of distinct first name andlast name fields (neither of which is allowed to be empty),we have a choice in labeling our features: we can consider ann-gram coming from a first name to be either identical to ordistinct from an n-gram coming from a last name. We con-sidered both approaches in our initial trials and came to theconclusion that using distinct labels for first and last namen-grams performs best for all values of n. (See Figure 1.)

3.2 Dataset and MetricsWe trained our classifier on a sample of roughly 60 million

LinkedIn accounts that were either (a) in good standing asof June 14, 2013 or (b) had been flagged by the LinkedInSecurity team as fake and/or abusive at some point beforethat date. We labeled accounts in (a) as good and those in(b) as spam. We sampled accounts not in our training setto create a validation/test set of roughly 200,000 accounts.Since the incidence of spam accounts in our data set is verylow, we biased our validation set to contain roughly equalnumbers of spam and non-spam accounts. This bias allowsus both to test the robustness of our algorithm to variationsin the types of names on spam accounts and to speed up ourtesting process.

We began with of a preprocessing step that computedtables of n-gram frequencies from raw name data for n =1, . . . , 5. Our algorithms are fully Unicode-aware, which al-lows us to classify international names but leads to a largenumber of n-grams. To reduce memory usage, we ignoredn-grams that had only a single instance in the training set.Table 1 shows the number of distinct n-grams for each value

first/last distinct first/last combinedn n-grams memory n-grams memory1 15,598 25 MB 8,235 24 MB2 136,952 52 MB 86,224 45 MB3 321,273 110 MB 252,626 108 MB4 1,177,675 354 MB 799,985 335 MB5 3,252,407 974 MB 2,289,191 803 MB

Table 1: Number of distinct n-grams in our training setfor n = 1, . . . , 5 and amount of memory required to storethe tables in our Python implementation. Data on the leftconsiders the sets of n-grams from first and last names in-dependently, while data on the right considers the sets cu-mulatively.

of n and the corresponding amount of memory required tostore the table of tuples (w, log(θw1/θw0)) for all n-grams w.

3.3 Initial resultsIn our experiments we use the probability estimate s(~x) of

(2.3) to assign to each test case a real-number score between0 and 1. When comparing different models we use area underthe ROC curve (AUC) as our primary metric to comparethe outcomes of two trials. We favor this metric because itdoes not require us to make a decision on a threshold forspam labeling and/or the relative weights of false positivesand false negatives. In addition, AUC is insensitive to biasin our validation set. Specifically, since our validation set isconstructed as a sample of some fraction x of spam accountsand a different fraction y of non-spam accounts, changing thevalues of x and y should have no impact on the expectedvalue of AUC.

In testing variants and parameters, we use a validationset of 97,611 accounts (48,461 spam and 49,150 not spam).Figure 1 shows results for our initial scoring, with valuesof n ranging from 1 to 5. We ran the classifier using twodifferent feature sets: one where first name and last namen-gram data are combined, and one where they are distinct.We see that using distinct features is superior for all valuesof n. All subsequent tests were conducted using distinctfeatures.

It appears from the data that increasing n improves theclassification only up to a certain point — the results forn = 5 are worse than those for n = 4. We conjecture thatthe number of 5-grams that appear in the validation setbut not in the training set offsets the improvement fromincreased granularity. See Section 4.2 for further discussionand ways to ameliorate the situation.

We also considered various smoothing parameters: Laplacesmoothing αw,y = δ and interpolated smoothing αw,y =δ/Nw,y, each for δ ∈ (0.01, 0.1, 1, 10, 100). To our surprise,and contrary to the results of Kohavi, Becker, and Som-merfield [17], we found that a constant αw,y = 0.1 performsbest in our experiments, slightly outperforming interpolatedsmoothing with δ = 10 (see Figure 2). We use this smooth-ing parameter for all subsequent experiments.

1 2 3 4 5

0.70

0.74

0.78

0.82

AUC for basic algorithm

Length of n-grams

AUC

first+last name n-grams combinedfirst+last name n-grams distinct

Figure 1: AUC for basic algorithm, with first and last namefeatures distinct or combined.

3 4 5

0.78

0.79

0.80

0.81

0.82

AUC for various smoothing parameters

Length of n-grams

AUC

Laplace smoothinginterpolated smoothing

∂=0.01∂=0.1∂=1∂=10

Figure 2: AUC for basic algorithm with various smoothingparameters. We omit the plots for n = 1 and n = 2 sinceall parameters perform approximately the same. We omitthe plots for δ = 100 since the results are significantly worsethan those plotted.

1 2 3 4 5

0.70

0.74

0.78

0.82

AUC for algorithm with initial/terminal clusters

Length of n-grams

AUC

basic algorithmwith initial/terminal clusters

Figure 3: AUC for basic algorithm, with and without ini-tial and terminal clusters.

4. IMPROVING THE FEATURE SET

4.1 Initial and terminal clustersTo augment the feature set, we observe that certain n-

grams may be more or less likely to correspond to fake ac-counts if they start or end a name. For example, in ourtraining set the proportion if spam accounts with the 2-gramzz appearing at the start of a name is 13 times the propor-tion of spam accounts with zz appearing anywhere in thename. We thus augment the feature set by adding phantom“start-of-word” and “end-of-word” characters to each namebefore splitting it into n-grams.

Formally, we add two new symbols \^ and \$ to our dic-tionary of letters. For an m-character name W = c1c2 · · · cmwe set c0 =\^, cm+1 =\$, W ′ = c0c1 · · · cmcm+1, and com-pute n-grams(W ′) using the definition of (3.1). (Clearlythis only adds information if n > 1.) With these phantomcharacters added, the name David breaks into 3-grams as(\^Da, dav, avi, vid, id\$).

We recomputed our training data with initial and terminalclusters; the results appear in Figure 3. The magnitudeof the improvement increases with n., and for n = 5 theimprovement is large enough to make n = 5 better thann = 4 (AUC 0.843 for n = 5 vs. 0.836 for n = 4).

4.2 Including missing featuresIt may happen that a name in our validation set contains

an n-gram w that does not appear in our training set. Thereare two standard ways of dealing with this occurrence [17]:either ignore the feature w entirely (which is equivalent toassuming that θw0 = θw1, i.e., the probabilities of w ap-pearing in each class are identical), or estimate values forθwy. Until this point we have chosen the former; we nowinvestigate the latter.

Intuitively, we expect an n-gram that is so rare as to notappear in our training set to be more likely to belong to aspammer than an n-gram selected at random. Indeed, this iswhat the data from the validation set show: the proportionof missing features that belong to spam accounts ranges be-tween 76% and 87% as n varies. The data also supports ourhypothesis that the reason n = 5 does not outperform n = 4is that many more features are missing (8.8% vs. 3.0%), and

1 2 3 4 5

good accountsspam accounts

Missing features as a percentage of all features

length of n-grams

pct

02

46

810

Figure 4: Proportion of missing features in validation set,broken down by spam label.

thus were ignored in our initial analysis. The proportions ofmissing features for each n can be seen in Figure 4.

To estimate parameters for missing features, we take theapproach of considering all missing features (for a fixed valueof n) as a single n-gram w. We compute θwy by splittingthe data in half and labeling n-grams that appear only inone half as “missing.” More precisely, we do the following:

Algorithm 4.1. Input: a set χ of training samples W ,labeled with classes y, and a fixed integer n. Output: pa-rameter estimates for missing values.

1. Randomly assign each training sample W ∈ χ to oneof two sets, A or B, each with probability 1/2.

2. Define g(A) =⋃W∈A n-grams(W ) and

g(B) =⋃W∈B n-grams(W ).

3. For each class y, define

Nw,y =∑

w∈g(A)

w 6∈g(B)

Nw,y +∑

w∈g(B)

w 6∈g(A)

Nw,y

4. Compute θwy using (2.5).

When we include this “missing feature” feature, the AUCvalues for n ≤ 4 do not change noticeably, but the AUC forn = 5 increases from 0.843 to 0.849.

4.3 Recursively iterating the classifierThe above method incorporates a “one-size-fits-all” ap-

proach to estimating parameters for missing features. Wecan do better by observing that even if a given n-gram isnot in our training set, the two component (n − 1)-gramsmay appear in the training set. If one or both of these(n− 1)-grams are missing, we can go to (n− 2)-grams, andso on. We thus propose the following recursive definition forestimating the parameter θwy:

Definition 4.2. Let w be a character string of length n. Letθwy be a parameter estimate for missing values of 1-grams(as computed by Algorithm 4.1 wth n = 1), and let αw,y be

a smoothing parameter. We define θwy as follows:

θwy =

Nw,y + αw,yNy +

∑w αw,y

if Nw,y > 0,∏w′∈(n−1)-grams(w)

θw′y if Nwy = 0 and n > 1,

θwy if Nwy = 0 and n = 1,

We observe that under this formulation, if two consecu-tive n-grams are missing, then the overlapping (n−1)-gramappears twice in the product θwy. At the extreme, if a par-ticular character does not appear in the training set, the“missing feature” parameter θwy is incorporated 2n−1 timesin the product θwy.

When we use the formula of Definition 4.2 to computethe parameters θwy for our classifier, the results for n ≤ 4change negligibly, but the AUC for n = 5 improves from0.849 to 0.854.

5. EVALUATING PERFORMANCE

5.1 Results on Test SetWe ran our algorithm on a sequestered test set of 97,965

accounts, of which 48,726 were labeled as spam and 49,239were labeled as good. We tried two different sets of param-eters: a “full” version using 5-grams that is informed by theabove discussion, and a “lightweight” version using 3-gramsthat is designed to be faster and use less memory. In lightof the above discussion we used the following parameters:

Full Version Lightweight Versionn 5 3αw,y 0.1 0.1initial/terminal yes yesmissing n-grams (n− 1)-grams fixed estimate

(Definition 4.2) (Algorithm 4.1)AUC 0.852 0.803

Figure 5 gives histograms of scores for the two runs. Asexpected for the Naive Bayes classifier, scores cluster veryclose to 0 or 1. We see by comparing the histograms thatscores computed by the lightweight algorithm are weightedmore towards the high end than those computed by the fullalgorithm. For example, we have the following statistics forscores below 0.05 and above 0.95 (the leftmost and rightmostbars in the histograms, respectively):

Score+Label Full Lightweightscore < 0.05, spam 16191 (33%) 12525 (26%)score < 0.05, good 928 (1.9%) 433 (0.9%)score > 0.95, spam 19259 (40%) 25844 (53%)score > 0.95, good 44966 (91%) 45753 (93%)

(The percentages indicate the proportion of all sampleswith that label appearing in the given histogram bucket.)As a result, the lightweight version finds fewer false positivespam names and more false negatives.

Precision and Recall.In our setup, for a given score cutoff c, the precision func-

tion prec(c) denotes the fraction of correct labels out of all

Histogram of scores for "full" algorithm

Score

Frequency

0.0 0.2 0.4 0.6 0.8 1.00

20000

40000

60000

80000

good accountsspam accounts

Histogram of scores for "lightweight" algorithm

Score

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

020000

40000

60000

80000

good accountsspam accounts

Figure 5: Histograms of scores for full and lightweight mod-els, broken down by spam label.

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8

0.9

1.0

Precision-recall plots for name scoring algorithm

Recall

Precision

Full algorithmLightweight algorithm

Figure 6: Precision–recall curves for full and lightweightmodels. Solid lines indicate original test set; dashed linesindicate test set corrected for mislabeled samples.

accounts labeled as spam by our classifier, while the recallfunction rec(c) denotes the fraction of correct labels out ofall spam accounts. The solid lines in Figure 6 give precision–recall curves for the two versions of our algorithm when runagainst the sequestered test set. We observe that the algo-rithms perform similarly at the low end, where most sam-ples are spam. The full algorithm differentiates itself in themiddle of the range: as recall increases above 0.35 (corre-sponding to a score cutoff of about 0.1 for the full algorithmand 0.5 for the lightweight algorithm), the precision of thelightweight algorithm declines much more quickly than thatof the full algorithm.

F -score.When trying to detect spammy names, it is typically the

case that that the cost of a false positive in spam detection(blocking a good user) is significantly higher than the costof a false negative (leaving a fake account alone). A gooduser who is blocked represents lost engagement with the site— and potentially lost revenue — while a bad user who isnot blocked based on a spammy name may be later caughtand blocked based on some other signal. Furthermore, falsepositives are magnified by the fact that the vast majority ofusers are not spammers.

Thus to determine a reasonable score cutoff for our clas-sifier, we consider an F -score that weights precision moreheavily than recall. Specifically, given a parameter β thatindicates the ratio of importance of recall to importance ofprecision, we compute

maxFβ = maxc∈[0,1]

(1 + β2)

(1

β2 prec(c)+

1

rec(c)

)−1

.

(See [28, Ch. 7].) We computed maxFβ scores for β =1, 1/2, 1/4, . . . , 1/64 for the full and lightweight algorithmsand obtained the following results:

Full algorithmβ maxFβ-score score cutoff precision recall1 0.769 0.999 0.727 0.817

2−1 0.803 0.966 0.862 0.6292−2 0.876 0.646 0.921 0.4902−3 0.921 0.127 0.943 0.3702−4 0.954 3.49e-05 0.966 0.2302−5 0.975 8.89e-11 0.980 0.1552−6 0.975 1.21e-08 0.976 0.179

Lightweight algorithmβ maxFβ-score score cutoff precision recall1 0.729 0.999 0.643 0.842

2−1 0.745 0.966 0.849 0.4992−2 0.862 0.670 0.936 0.3802−3 0.928 0.147 0.961 0.2912−4 0.964 1.19e-02 0.978 0.2012−5 0.978 5.75e-05 0.983 0.1632−6 0.981 6.34e-05 0.983 0.164

We see from these computations that if precision is weightedmuch higher than recall (β ≤ 1/8), the two algorithms per-form roughly identically. As recall gains in weight, we seethe full algorithm begin to outperform the lightweight ver-sion.

5.2 Relative Feature WeightsIn general, all it takes is one spammy n-gram to make a

name spammy; for example, we would consider the name“XXXDavid” to be spam even though the substring “David”looks good. Thus for our classifier to perform well, badfeatures should be weighted more heavily than good features.

Our analysis suggests that this property holds intrinsicallyfor our data set, and we believe it results from the natureof our training data. Specifically, the “spam” labels in ourtraining set consist of all accounts that have been flagged asspam, not just those that have spammy names. Thus it isfar more likely for a “good”n-gram such as Dav to be foundon a spam account than it is for a “spammy” n-gram suchas XXX to be found on a good account. In terms of theparameters that comprise the score s(~x), the absolute valueof log(θXXX1/θXXX0) will tend to be much larger than theabsolute value of log(θDav1/θDav0). As a result, even onespammy n-gram will not be counterbalanced by any numberof good n-grams.

To evaluate this concept quantitatively, we bucketed scoresproduced by our lightweight algorithm by bR(~x)c (see (2.4))and computed the following quantities:

• percentage of samples in a bucket with a “very bad”feature, defined as an n-gram w for whichlog(θw1/θw0) < −1,

• percentage of samples in a bucket with a “very good”feature, defined as an n-gram w for whichlog(θw1/θw0) > 1.

The results appear in Figure 7. We see that as scores ap-proach the middle from the extremes, the rate of very badfeatures stays high while the rate of very good features startsto decline sooner. For example, when bR(~x)c = 5, 75% ofsamples have a “very good” feature with log(θw1/θw0) > 1.On the other hand, when bR(~x)c = −5, 94% of samples havea “very bad” feature with log(θw1/θw0) < −1. We concludethat the “very bad” features contribute more to the ultimatescore than the “very good” features.

-10 -5 0 5 10

0.2

0.4

0.6

0.8

1.0

Percentage of samples with at least one very good/very bad feature

floor(log(R(x)))

perc

enta

ge o

f sam

ples

good featurebad feature

Figure 7: Percentage of samples with at least one verygood or very bad feature, bucketed by blogR(~x)c.

5.3 False PositivesWe manually reviewed all 928 names in our test set that

were labeled “good” but were given a score of 0.05 or less byour full algorithm. We found that a majority of these sam-ples (552, or 59%) were in fact spammy names that had notbeen labeled as such. Labeling these alleged false postivessamples correctly significantly improves precision at the lowend of the score spectrum. The dashed lines in Figure 6show precision–recall plots for the test set with the labelscorrected.

Of the manually-reviewed false positives that we determednot to be spammy names, we observed a few patterns thatrepeated:

• Mixed-language names. For example, a Chinese personmay enter his or her name in Chinese characters andEnglish transliteration. While LinkedIn does supportnames in multiple languages [19], users may not beaware of this feature or may want both languages toshow simultaneously.

• Users interchanging the first and last name fields. Sinceour algorithm uses different features for first and lastnames, a name that looks good when entered in thecorrect fields may come up as spam when the fieldsare interchanged.

• Users entering readable names with extra accents and/orunusual characters. For example, a user might enter“David” as “∂aν1δ.” We speculate that these users areattempting to maintain a legitimate presence on thesite while making it more difficult to search for theirnames.

• Users entering non-name information in the name fields,such as degrees, professional credentials, or militaryranks.

In Section 7 we discuss some possible approaches to filteringout false positives in these categories.

5.4 False NegativesRecall that the “spam” accounts in our training and test

sets consist of accounts that had been flagged as fraudulentand/or abusive for any reason, not just having a spammyname. Many of the spam accounts on LinkedIn have namesthat do not appear spammy, so we would not expect ourclassifier to detect these accounts. The histograms of Fig-ure 5 support this assertion: while more than 90% of goodaccounts have scores greater than 0.95, either 40% (full algo-rithm) or 53% (lightweight algorithm) of the spam accountsin our test set have scores above that point and would thusbe labeled as false negatives.

To further support this assertion, we manually revieweda sample of 100 accounts in our test set that were labeledas spam but were given scores above 0.646 by the full algo-rithm and above 0.670 by the lightweight algorithm. (Thesecutoffs were chosen to maximize F1/4-score as discussed inSection 5.1.) Only seven of these accounts had names thatwere spammy in our judgment.

If we extrapolate this 93% false negative rate in our la-beling of high-scoring accounts to the entire validation setabove the cutoffs, then at these cutoffs we obtain recall 0.93for the full algorithm and 0.90 for the lightweight algorithm(compared with 0.49 and 0.38, respectively, using the initiallabels).

5.5 Running on Live DataWe implemented the lightweight version of our algorithm

in Python and trained it on a subset of the dataset dis-cussed in Section 3.2. Using Hadoop streaming, we ran thealgorithm daily on new LinkedIn registrations throughoutthe month of May 2013. The lowest-scoring accounts wereflagged as spam and sent to the Trust and Safety team forappropriate action. We ran our new algorithm in parallelwith LinkedIn’s previous name-spam detection procedure,which was based upon regular expressions.

As of early July, 3.3% of the accounts labeled as spam byour new algorithm had had this label removed, while 7.0%of the accounts labeled as spam by the old algorithm hadhad their spam labels removed. LinkedIn deemed the newalgorithm so effective that the old algorithm was retired bymid-June.

6. SCORING EMAIL ADDRESSESOur algorithm can be applied to detect spam not only in

names but also in other short strings. In particular, emailusernames are particularly relevant for this kind of scor-ing. While there is nothing about an email address thatviolates terms of service or degrades the value of a socialnetwork, there are still email addresses that are “spammier”than others. For example, a real user’s email address willoften look like a name or a word in some language, while auser who just wants to get past a registration screen maysign up with “[email protected],” and a bot generating a seriesof accounts may sign up with email addresses consisting ofrandom strings of characters at some email provider.

We experimented with email data by incorporating it inour algorithm in two ways: running the algorithm on emailusernames alone, and running the algorithm on email user-names along with first and last names. (That is, we addedn-grams for email username to the feature set that includedn-grams for first and last names.) For the latter input data,

0.0 0.2 0.4 0.6 0.8 1.0

0.5

0.6

0.7

0.8

0.9

1.0

Precision-recall plots for name/email scores

Recall

Precision

Names onlyEmails onlyNames and emails

Figure 8: Precision–recall curves for algorithm on emailusernames, first/last name, and all three feature sets.

the max F1/8-score is 0.936 (precision 0.966, recall 0.312),at score cutoff 0.013. Precision–recall curves appear in Fig-ure 8.

By itself, our algorithm applied to email usernames is nottoo helpful in finding spammers. Precision drops off veryquickly, and even at the very low end we have worse resultsthan our name scoring algorithm: of the 6,149 accounts withemail scores less than 0.05, we find 5,788 spam accounts and361 good accounts. Correcting for the mislabeled accountswe found in Section 5.3 shifts only 28 labels from good tospam.

However, we find that our results improve when we useemail scoring to supplement name scoring; the email fea-tures provide a sort of “tiebreaker” for scores that are other-wise borderline. In particular, adding email data improvesresults in the middle of the precision–recall range. Furtherwork is needed to assess the magnitude of this effect andto investigate other ways of combining email scoring withname scoring.

7. CONCLUSIONS AND FURTHER WORKWe have presented a Naive Bayes classifier that detects

spammy names based on the n-grams (substrings of n char-acters) appearing in that name, and discussed techniques forimproving and optimizing the algorithm. We tested “full”and “lightweight” versions of the algorithm on data fromLinkedIn and showed that both were effective at catchingspam names at high precision levels. The lightweight ver-sion becomes less effective more quickly than the full versionas the score cutoff increases.

One important direction for further work is to reduce thealgorithm’s false positive rate. In Section 5.3 we discussedsome categories of false positives returned by the algorithm;here we present some possible approaches to detecting andeliminating them.

• Mixed-language names: if multiple languages are de-tected, one could try to parse each name field into lan-guage blocks and score each block separately. If thename field contains a real name in each language thenboth scores will be high.

• Switching name fields: one could try to score the firstname as a last name and vice versa. If the name isactually spam then both permutations should scorebadly, while if the name is legitimate but the fieldsare switched the scores of the permuted names shouldbe high. The relative weighting of the permuted scoresvs. the original scores would have to be determined ex-perimentally; we don’t want to negate the advantagegained by scoring first and last names independently(cf. Figure 1).

• Unusual characters: one could try scoring a “reduced”name obtained by mapping Unicode characters to ASCIIstrings. Again we would want to ensure that this scorewas appropriately weighted so as to not negate the ad-vantage of scoring on the entire Unicode alphabet.

• Non-name information: one could try to detect appro-priate non-name information such as degrees or quali-fications (as opposed to irrelevant information such ascompany names or job titles) by matching to a list. Abetter approach would be to have a separate field forsuch information, so the name field could be reservedfor the name only.

Another direction for further work is to strengthen theadversarial model. Specifically, we have trained our classi-fier under the assumption that the distribution of spammynames is constant over time, while in practice it is likely thatmalicious spammers will change their name patterns whenthey find themselves getting blocked due to our classifica-tion. We expect that if this adaptation occurs at a signifi-cant scale, then the offenders will be caught LinkedIn’s otherspam-detection systems and the name-spam model can thenbe retrained using the most recent data. To investigate theseeffects we plan to retrain the model at regular intervals andtrack the success rate of each iteration.

Acknowledgments.The author thanks John DeNero for fruitful discussions

and Sam Shah, Vicente Silveira, and the anonymous refereesfor helpful feedback on initial versions of this paper.

8. REFERENCES[1] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos,

and C. D. Spyropoulos. An experimental comparisonof Naive Bayesian and keyword-based anti-spamfiltering with personal e-mail messages. In Proceedingsof the 23rd annual international ACM SIGIRconference on Research and development ininformation retrieval, SIGIR ’00, pages 160–167, NewYork, NY, USA, 2000. ACM.

[2] I. Androutsopoulos, G. Paliouras, V. Karkaletsis,G. Sakkis, C. D. Spyropoulos, and P. Stamatopoulos.Learning to filter spam e-mail: A comparison of anaive Bayesian and a memory-based approach. InProceedings of the Workshop ”Machine Learning andTextual Information Access”, European Conference onPrinciples and Practice of Knowledge Discovery inDatabases (PKDD), pages 1–13, 2000.

[3] F. Benevenuto, G. Magno, T. Rodrigues, andV. Almeida. Detecting spammers on Twitter. InCEAS 2010 - Seventh annual Collaboration, Electronicmessaging, AntiAbuse and Spam Conference, 2010.

[4] P. N. Bennett. Assessing the calibration of NaiveBayes posterior estimates. Technical report, DTICDocument, 2000. Available at http:

//www.dtic.mil/dtic/tr/fulltext/u2/a385120.pdf.

[5] A. Blum and T. M. Mitchell. Combining labeled andunlabeled data with co-training. In COLT, pages92–100, 1998.

[6] G. Brown, T. Howe, M. Ihbe, A. Prakash, andK. Borders. Social networks and context-aware spam.In Proceedings of the 2008 ACM conference onComputer supported cooperative work, CSCW ’08,pages 403–412, New York, NY, USA, 2008. ACM.

[7] Q. Cao, M. Sirivianos, X. Yang, and T. Pregueiro.Aiding the detection of fake accounts in large scalesocial online services. In NDSI, 2012.

[8] Q. Cao and X. Yang. Sybilfence: Improvingsocial-graph-based sybil defenses with user negativefeedback. Duke CS Technical Report: CS-TR-2012-05,available at http://arxiv.org/abs/1304.3819, 2013.

[9] S. F. Chen and J. Goodman. An empirical study ofsmoothing techniques for language modeling. InProceedings of the 34th annual meeting on Associationfor Computational Linguistics, ACL ’96, pages310–318, Stroudsburg, PA, USA, 1996. Association forComputational Linguistics.

[10] P. Domingos and M. Pazzani. Beyond independence:Conditions for the optimality of the simple Bayesianclassifier. In Machine Learning, pages 105–112.Morgan Kaufmann, 1996.

[11] P. Domingos and M. Pazzani. On the optimality of thesimple Bayesian classifier under zero-one loss. Machinelearning, 29(2-3):103–130, 1997.

[12] R. Duda and P. E. Hart. Pattern classification andscene analysis. Wiley and Sons, Inc., 1973.

[13] Facebook Inc. Statement of rights and responsibilities.http://www.facebook.com/legal/terms, accessed 17Jul 2013.

[14] H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B. Y.Zhao. Detecting and characterizing social spamcampaigns. In Proceedings of the 10th ACMSIGCOMM conference on Internet measurement, IMC’10, pages 35–47, New York, NY, USA, 2010. ACM.

[15] Google Inc. User content and conduct policy. http://www.google.com/intl/en/+/policy/content.html,accessed 17 Jul 2013.

[16] J. Hovold. Naive Bayes spam filtering usingword-position-based attributes. In CEAS, 2005.

[17] R. Kohavi, B. Becker, and D. Sommerfield. Improvingsimple Bayes. In Proceedings of the EuropeanConference on Machine Learning, 1997.

[18] LinkedIn Corporation. Internal data, accurate as of 1Jun 2013.

[19] LinkedIn Corporation. Creating or deleting a profile inanother language. http://help.linkedin.com/app/answers/detail/a_id/1717, accessed 22 July 2013.

[20] LinkedIn Corporation. User agreement.http://www.linkedin.com/legal/user-agreement,accessed 17 Jul 2013.

[21] A. McCallum and K. Nigam. A comparison of eventmodels for Naive Bayes text classification. InProceedings of AAAI, 1998.

[22] V. Metsis, I. Androutsopoulos, and G. Paliouras.Spam filtering with Naive Bayes — which NaiveBayes? In CEAS, 2006.

[23] T. Mitchell. Machine Learning. McGraw-Hill, 1997.

[24] I. Rish. An empirical study of the Naive Bayesclassifier. In IJCAI 2001 workshop on empiricalmethods in artificial intelligence, volume 3, pages41–46, 2001.

[25] M. Sahami, S. Dumais, D. Heckerman, andE. Horvitz. A Bayesian approach to filtering junke-mail. In Learning for Text Categorization: Papersfrom the AAAI Workshop, Madison Wisconsin,volume Technical Report WS-98-05, pages 55–62.AAAI Press, 1998.

[26] K.-M. Schneider. A comparison of event models forNaive Bayes anti-spam e-mail filtering. In EACL,pages 307–314, 2003.

[27] A. K. Seewald. An evaluation of Naive Bayes variantsin content-based learning for spam filtering. IntelligentData Analysis, 11(5):497–524, 2007.

[28] C. J. van Rijsbergen. Information Retrieval.Butterworth-Heinemann, 2nd edition, 1979.

[29] G. Wang, T. Konolige, C. Wilson, X. Wang, H. Zheng,and B. Y. Zhao. You are how you click: Clickstreamanalysis for sybil detection. In Proceedings of The 22ndUSENIX Security Symposium, pages 241–256, 2013.

[30] C. M. Zhang and V. Paxson. Detecting and analyzingautomated activity on Twitter. In PAM, pages102–111, 2011.

[31] H. Zhang. The optimality of Naive Bayes. In FLAIRSConference, pages 562–567, 2004.