aframeworkforreal-timespamdetectionintwitter · aframeworkforreal-timespamdetectionintwitter himank...

1
A Framework for Real-Time Spam Detection in Twitter Himank Gupta, Mohd Saalim Jamal, Sreekanth Madisetty and Maunendra Sankar Desarkar Department of Computer Science and Engineering, Indian Institute of Technology Hyderabad, India Email: [cs16mtech01001, cs16mtech11024, cs15resch11006, maunendra]@iith.ac.in 1. Introduction Due to the increasing popularity of Twitter, this plat- form influences more number of spammers to gen- erate spam tweets. Twitter platform experienced a 355% growth of social spam during last 4 years. Figure 1: Growth of users on Twitter platform in last 7 years a a https://www.statista.com/statistics/282087/number-of-monthly-active-twitter- users 2. Motivation 1. It is crucial to detect Twitter spam as soon as possible in real-time because 90% of users might visit a new spam link before it gets blocked by the blacklist [1]. 2. We need to choose lightweight features that should be feasible to process a large number of tweets in very less time. -3 -2.5 -2 -1.5 -1 -0.5 1st Principal Component 10 4 -1 0 1 2 3 4 5 6 7 2nd Principal Component 10 4 Spam Non-Spam Figure 2: Scatter plot of dataset [2] showing distribution of two classes namely, spam(x) and non-spam(.) Figure 3: Use of Twitter platform on different devices a a https://www.statista.com/chart/1520/number-of-monthly-active-twitter-users/ 3. Extracted Features Feature Name Description account age The age (days) of an accounl since its creation until the time of sending the most recent tweet no_followers The number of followers of this Twitter user no_followee The number of followee/friends of this Twitter user no_userfavourites The number of favourites this Twitter user received no_lists The number of lists this Twitter user added no_tweets The number of tweets this Twitter user sent no_retweets The number of retweets this tweet no_hashtags The number of hashtags included in this tweet no_usermentions The number of user mentions included in this tweet no_urls The number of URLs included in this tweet no_chars The number of characters in this tweet no_digits The number of digits in this tweet no_non-ASCII_characters The number of non-ASCII characters in this tweet 4. Cumulative Distribution Functions of Features We investigated each feature’s characteristics of differentiating spam and non-spam tweet. Following figures shows the Cumulative Distribution Function (CDF) of three user based features and three message based features. Analysis of these features has showed us their discriminative power to detect Twitter spam. 10 0 10 1 10 2 10 3 10 4 10 5 10 6 No of Tweets 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 CDF Non-Spam Spam 10 0 10 1 10 2 10 3 10 4 10 5 10 6 No of Followings 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CDF Non-Spam Spam 10 0 10 1 10 2 10 3 10 4 10 5 No of Lists 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CDF Non-Spam Spam 0 20 40 60 80 100 120 140 No of Characters Per Tweet 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CDF Non-Spam Spam 0 5 10 15 20 25 30 35 40 45 50 No of Digits per Tweet 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CDF Non-Spam Spam 0 2 4 6 8 10 12 No of Hashtags per Tweet 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CDF Non-Spam Spam 5. Proposed Work U S = Collection of unique words in the spam tweets’ text. U NS = Collection of unique words in the non-spam tweets’ text. For each word T in U S and U NS we calculate the following probablity values: P(T |U S )= # of Spam tweets that contain T total # of Spam tweets P(T |U NS )= # of Non-Spam tweets that contain T total # of Non-Spam tweets We calculate the information gain γ T for each word T as follows: γ T = P(T |U S ) P(T |U NS ) × log 10 P(T |U S ) P(T |U NS ) We use top-30 words based on information gain γ T along with the lightweight features described in section 3. Table 1: Top 10 Words Spam Words Non-Spam Words harvested rain tribez asleep coins rather collected college unfollower fell openfollow follback inspi dinos build bullshit smurf child brainy couch Table 2: Classification of Example Tweets S.No Sample Tweet with Feature Set With Bag-of-Words Model Without Bag-of-Words Model 1. I’ve collected 12,293 gold coins! http://t.co/MXyllUOlZa #android, #androidgames, #gameinsight (1944,11,19,0,0,13134,0,3,0,1,21,5,1) Spam Spam 2. Also, please go out and vote for your local councillors today #publicserviceannouncement (3419,923,753,674,33,29769,0,1,0,0,50,0,0) Non-Spam Non-Spam 3. Get Unlimited Followers!!!! https://t.co/EKqMEvvmGF (5,0,9,0,0,1,0,0,0,1,21,0,4) Spam Non-Spam 6. Results Table 3: Sampled Features Set Feature-Set Sampling Method Ratio (Spam:Non-Spam) 1 Use 43 features to train a model 1:2 2 Use Bag-of-Word to select features in libsvm format 1:2 3 Use Chao Chen’s [2] feature-set for comparison 1:1 Table 4: Performance evaluation on different Feature set Feature-set- 1 Feature-set- 2 Feature-set- 3 Classifier F-Measure Accuracy F-Measure Accuracy F-Measure Accuracy SVM with Kernel 86.18 85.95 84.28 83.88 79.9 79.1 Neural Network 90.56 91.65 - - 71.25 72.15 Gradient Boosting 75.81 85.84 - - 81.26 82.69 Random Forest 75.39 86.25 - - 93.6 92.9 7. Conclusions & Future Work 1. We present a novel framework for real-time spam detection in Twitter using top-30 words along with the lightweight features that outperformed the existing work [2] by 18%. 2. We will keep on updating our Bag-of-Words model based on new spam tweets to mitigate the "Spam-Drift" problem. 3. We are developing smart phone application and browser extension to detect Twitter Spam in real-time as 80% of Twitter users access Twitter via their mobile devices. 4. We observe in our dataset that 79% of spam tweets contain a malicious link. So we will also perform the URL crawl mechanism along with Frequent Pattern Mining of tweets’ text to distinguish Twitter spam in real-time. 8. References [1] Kurt Thomas, Chris Grier, Dawn Song, and Vern Paxson. Suspended accounts in retrospect: An analysis of twitter spam. In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, IMC ’11, pages 243–258, New York, NY, USA, 2011. ACM. [2] C. Chen, J. Zhang, X. Chen, Y. Xiang, and W. Zhou. 6 million spam tweets: A large ground truth for timely twitter spam detection. In 2015 IEEE International Conference on Communications (ICC), pages 7065–7070, June 2015.

Upload: others

Post on 30-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AFrameworkforReal-TimeSpamDetectioninTwitter · AFrameworkforReal-TimeSpamDetectioninTwitter Himank Gupta, Mohd Saalim Jamal, Sreekanth Madisetty and Maunendra Sankar Desarkar Department

AFramework forReal-TimeSpamDetection inTwitterHimank Gupta, Mohd Saalim Jamal, Sreekanth Madisetty and Maunendra Sankar Desarkar

Department of Computer Science and Engineering, Indian Institute of Technology Hyderabad, IndiaEmail: [cs16mtech01001, cs16mtech11024, cs15resch11006, maunendra]@iith.ac.in

1. IntroductionDue to the increasing popularity of Twitter, this plat-form influences more number of spammers to gen-erate spam tweets. Twitter platform experienced a355% growth of social spam during last 4 years.

Figure 1: Growth of users on Twitter platform in last 7 years a

ahttps://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users

2. Motivation1. It is crucial to detect Twitter spam as soon as possible

in real-time because 90% of users might visit a newspam link before it gets blocked by the blacklist [1].

2. We need to choose lightweight features that shouldbe feasible to process a large number of tweets in veryless time.

-3 -2.5 -2 -1.5 -1 -0.5

1st Principal Component 10 4

-1

0

1

2

3

4

5

6

7

2n

d P

rin

cip

al C

om

po

ne

nt

10 4

Spam

Non-Spam

Figure 2: Scatter plot of dataset [2] showing distribution of twoclasses namely, spam(x) and non-spam(.)

Figure 3: Use of Twitter platform on different devices a

ahttps://www.statista.com/chart/1520/number-of-monthly-active-twitter-users/

3. Extracted FeaturesFeature Name Description

account age The age (days) of an accounl since its creation until the time of sending the most recent tweetno_followers The number of followers of this Twitter userno_followee The number of followee/friends of this Twitter user

no_userfavourites The number of favourites this Twitter user receivedno_lists The number of lists this Twitter user added

no_tweets The number of tweets this Twitter user sentno_retweets The number of retweets this tweetno_hashtags The number of hashtags included in this tweet

no_usermentions The number of user mentions included in this tweetno_urls The number of URLs included in this tweet

no_chars The number of characters in this tweetno_digits The number of digits in this tweet

no_non-ASCII_characters The number of non-ASCII characters in this tweet

4. Cumulative Distribution Functions of FeaturesWe investigated each feature’s characteristics of differentiating spam and non-spam tweet.Following figures shows the Cumulative Distribution Function (CDF) of three user basedfeatures and three message based features. Analysis of these features has showed us theirdiscriminative power to detect Twitter spam.

10 0 10 1 10 2 10 3 10 4 10 5 10 6

No of Tweets

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

CD

F

Non-Spam

Spam

10 0 10 1 10 2 10 3 10 4 10 5 10 6

No of Followings

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F

Non-Spam

Spam

10 0 10 1 10 2 10 3 10 4 10 5

No of Lists

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F

Non-Spam

Spam

0 20 40 60 80 100 120 140

No of Characters Per Tweet

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F

Non-Spam

Spam

0 5 10 15 20 25 30 35 40 45 50

No of Digits per Tweet

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F

Non-Spam

Spam

0 2 4 6 8 10 12

No of Hashtags per Tweet

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CD

F

Non-Spam

Spam

5. Proposed WorkUS = Collection of unique words in the spam tweets’ text. UN S = Collection of uniquewords in the non-spam tweets’ text.For each word T in US and UN S we calculate the following probablity values:

P(T |US) =# of Spam tweets that contain T

total # of Spam tweetsP(T |UN S) =

# of Non-Spam tweets that contain Ttotal # of Non-Spam tweets

We calculate the information gain γT for each word T asfollows:

γT =

∣∣∣∣ P(T |US)P(T |UN S)

× log10P(T |US)

P(T |UN S)

∣∣∣∣We use top-30 words based on information gain γT alongwith the lightweight features described in section 3.

Table 1: Top 10 WordsSpam Words Non-Spam Words

harvested raintribez asleepcoins rather

collected collegeunfollower fellopenfollow follback

inspi dinosbuild bullshitsmurf childbrainy couch

Table 2: Classification of Example TweetsS.No Sample Tweet with Feature Set With Bag-of-Words Model Without Bag-of-Words Model1. I’ve collected 12,293 gold coins! http://t.co/MXyllUOlZa #android, #androidgames, #gameinsight

(1944,11,19,0,0,13134,0,3,0,1,21,5,1) Spam Spam2. Also, please go out and vote for your local councillors today #publicserviceannouncement

(3419,923,753,674,33,29769,0,1,0,0,50,0,0) Non-Spam Non-Spam3. Get Unlimited Followers!!!! https://t.co/EKqMEvvmGF

(5,0,9,0,0,1,0,0,0,1,21,0,4) Spam Non-Spam

6. ResultsTable 3: Sampled Features Set

Feature-Set Sampling Method Ratio (Spam:Non-Spam)1 Use 43 features to train a model 1:22 Use Bag-of-Word to select features in libsvm format 1:23 Use Chao Chen’s [2] feature-set for comparison 1:1

Table 4: Performance evaluation on different Feature set

Feature-set- 1 Feature-set- 2 Feature-set- 3Classifier F-Measure Accuracy F-Measure Accuracy F-Measure AccuracySVM with Kernel 86.18 85.95 84.28 83.88 79.9 79.1Neural Network 90.56 91.65 - - 71.25 72.15Gradient Boosting 75.81 85.84 - - 81.26 82.69Random Forest 75.39 86.25 - - 93.6 92.9

7. Conclusions & Future Work1. We present a novel framework for real-time spam detection in Twitter using top-30 words

along with the lightweight features that outperformed the existing work [2] by 18%.2. We will keep on updating our Bag-of-Words model based on new spam tweets to mitigate

the "Spam-Drift" problem.3. We are developing smart phone application and browser extension to detect Twitter

Spam in real-time as 80% of Twitter users access Twitter via their mobile devices.4. We observe in our dataset that 79% of spam tweets contain a malicious link. So we will also

perform the URL crawl mechanism along with Frequent Pattern Mining of tweets’text to distinguish Twitter spam in real-time.

8. References[1] Kurt Thomas, Chris Grier, Dawn Song, and Vern Paxson. Suspended accounts in

retrospect: An analysis of twitter spam. In Proceedings of the 2011 ACM SIGCOMMConference on Internet Measurement Conference, IMC ’11, pages 243–258, New York,NY, USA, 2011. ACM.

[2] C. Chen, J. Zhang, X. Chen, Y. Xiang, and W. Zhou. 6 million spam tweets: Alarge ground truth for timely twitter spam detection. In 2015 IEEE InternationalConference on Communications (ICC), pages 7065–7070, June 2015.