botgraph: large scale spamming botnet detection

43
1 BotGraph: Large Scale Spamming Botnet Detect ion Yao Zhao EECS Department Northwestern University

Upload: lucine

Post on 07-Jan-2016

56 views

Category:

Documents


0 download

DESCRIPTION

BotGraph: Large Scale Spamming Botnet Detection. Yao Zhao EECS Department Northwestern University. Outline. Motivation and Problem Definition BotGraph Algorithms History based algorithm on Signup detection Graph-based algorithm on login detection Parallel Implementation on DryadLINQ - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: BotGraph: Large Scale Spamming Botnet Detection

1

BotGraph: Large Scale Spamming Botnet Detection

Yao Zhao

EECS Department

Northwestern University

Page 2: BotGraph: Large Scale Spamming Botnet Detection

2

Outline

• Motivation and Problem Definition

• BotGraph Algorithms– History based algorithm on Signup detection– Graph-based algorithm on login detection

• Parallel Implementation on DryadLINQ

• Detection Results

• Discussion

• Conclusion

Page 3: BotGraph: Large Scale Spamming Botnet Detection

3

Web-Account Abuse Attack

zombieServer

Captcha solver

RDSXXTD3

User/Pwd

Page 4: BotGraph: Large Scale Spamming Botnet Detection

4

Problems and Challenges

• Web-account Abuse– Signup abuse– Spam sending

• Challenges– Accuracy requirement– Stealthy (in terms of spam sending)– Large scale attack and huge Hotmail log data

• Our Behavior-based Solutions– Correlate bot-users by their activities and identify the

group properties– Design parallel algorithms on DryadLINQ to efficiently

process large data

Page 5: BotGraph: Large Scale Spamming Botnet Detection

5

System Architecture

Login data

Login graph

Graph generation

Random graph based clustering

Verification & prune

Sendmail data

Spamming botnets

Suspicious clusters

Signup data

EWMA based change detection Aggressive

signups

Verification & prune

Signup botnets

Run on DryadLinq clusters Output locally

Page 6: BotGraph: Large Scale Spamming Botnet Detection

6

Outline

• Motivation and Problem Definition

• BotGraph Algorithms– History based algorithm on Signup detection– Graph-based algorithm on login detection

• Parallel Implementation on DryadLINQ

• Detection Results

• Discussion

• Conclusion

Page 7: BotGraph: Large Scale Spamming Botnet Detection

7

History Based Change Detection

Large prediction

error

Back to normal

Page 8: BotGraph: Large Scale Spamming Botnet Detection

8

EWMA based Change Detection

• EWMA (Exponentially Weighted Moving Average)– Yt : observation at time t, St : prediction at time t– St = α×Yt-1 + (1 - α)×St-1

• Large Prediction Error Implies Change (or Abnormal)– Et = Yt – St (Prediction error) – Rt = Yt / Max(St,ε) (Relative prediction error)

• Apply EWMA based Change Detection to Signup Time Series of Each IP Address

Page 9: BotGraph: Large Scale Spamming Botnet Detection

9

Outline

• Motivation and Problem Definition

• BotGraph Algorithms– History based algorithm on Signup detection– Graph-based algorithm on login detection

• Parallel Implementation on DryadLINQ

• Detection Results

• Discussion

• Conclusion

Page 10: BotGraph: Large Scale Spamming Botnet Detection

10

Normal and Bot-user Behaviors

• General Behaviors of Normal Users– Login Hotmail from home and/or office– The account shares IPs in one AS with others if dyna

mic IP is used• General Behaviors of Bot-users

– A pool of bots (e.g. thousands) and a pool of bot-users (e.g. hundreds of thousands)

• Each bot hosts multiple bot-users– Bot-user assigned to different random bots every day

• Fixed binding is not adopted now– A pair of bot-users have large chance to share severa

l different IPs in different ASes

Page 11: BotGraph: Large Scale Spamming Botnet Detection

11

User-user Graph

• Graph Model– A hotmail account => a node– A pair of accounts share IPs => an edge

• Edge weight = Number of different ASes the shared IPs belong to

• Consider edges with weight>1

• Key Observations– Bot-users form a giant connected component– Normal users do not form large connected component– Interpreted by the random graph theory

Page 12: BotGraph: Large Scale Spamming Botnet Detection

12

Random Graph Theory

• Random Graph G(n,p) – n nodes and a pair of nodes has an edge with probabi

lity p

• Theorem– A graph generated by G(n, p) has average weight

d = n · p. – If d < 1, then with high probability the largest compon

ent in the graph has size less than O(log n). – If d > 1, with high probability the graph will contain a g

iant component with size at the order of O(n).

Page 13: BotGraph: Large Scale Spamming Botnet Detection

13

Typical Bot-user Graphs

• Strategy 1– Bot-user accounts are randomly assigned to bots.

• Strategy 2– Keeps a queue of the bot-users. – A bot comes online and gets the top k available (curre

ntly not used) bot-users in the queue.

• Strategy 3– Similar to the second case, except that there is no limi

t on the number of bot-users a bot can request for one day.

Page 14: BotGraph: Large Scale Spamming Botnet Detection

14

Typical Bot-user Graphs

– 10000 bot-users, 10-day activity, k = 20

Page 15: BotGraph: Large Scale Spamming Botnet Detection

15

Bot-user Detection Algorithm

• Issues– Different bot-user groups may be connected (in the gr

aph with weight threshold 2)• Shared bots• Shared bot-users

– No fixed weight threshold T– Exceptions: exist large connected components forme

d by normal users• Detection Algorithm

– Hierarchical algorithm to extract connected components

– Pruning

Page 16: BotGraph: Large Scale Spamming Botnet Detection

16

Hierarchical Connected Component Extraction

G

A B

T=2

C D

E

T=3

T=4

Page 17: BotGraph: Large Scale Spamming Botnet Detection

17

Exceptions: Connected Subgraphs of Normal Users

• Potential Reasons– Some web service providers login Hotmail acc

ounts for users (e.g. Facebook, Linkedin)– National proxies– Cell phones (e.g. iPhone)– Tor (Onion routing)

• Solutions– Filter out some IPs– Prune potential good connected components

Page 18: BotGraph: Large Scale Spamming Botnet Detection

18

Prune Good Groups

• Email Sending Frequency– Normal users: generally don’t send many ema

ils in average– Bot-users: to be efficient, send several spams

every day

• Email Size– Normal users: random size– Bot-users: (currently) similar email size

Page 19: BotGraph: Large Scale Spamming Botnet Detection

19

Prune Good Groups

Bad:

Good:

Page 20: BotGraph: Large Scale Spamming Botnet Detection

20

Prune Good Groups

• Metrics– s1: the percentage of users who send more

than 3 emails per day

– s2: the percentage of users who send out emails with similar size (peak detection)

• Pruning– Threshold of s1 is 0.8 (conservative and wide

margins around 0.8)

– s2 is used in validation

Page 21: BotGraph: Large Scale Spamming Botnet Detection

21

Outline

• Motivation and Problem Definition

• BotGraph Algorithms– History based algorithm on Signup detection– Graph-based algorithm on login detection

• Parallel Implementation on DryadLINQ

• Detection Results

• Discussion

• Conclusion

Page 22: BotGraph: Large Scale Spamming Botnet Detection

22

Parallel Implementation on DryadLINQ

• EWMA Algorithm of Signup Abuse Detection– Partition data by IP (straightforward)

• Graph Construction– Two algorithms

• Connected Component Extraction– Divide and conquer

Page 23: BotGraph: Large Scale Spamming Botnet Detection

23

Connected Component Extraction

• Partitions of Edges– (User1, User2, weight)

(A, B) (D, G)

(B, C) (C, E)

(C, D) (E, F)

(B, G) (G, D)

Page 24: BotGraph: Large Scale Spamming Botnet Detection

24

Connected Component Extraction

(A, B) (D, G)

(B, C) (C, E)

(C, D) (E, F)

(B, G) (G, D)

(A, B) (D, G)

(B, C) (B, E)

(C, D) (E, F)

(B, G) (B, D)

Local Algo

Page 25: BotGraph: Large Scale Spamming Botnet Detection

25

Connected Component Extraction

(A, B) (D, G)

(B, C) (B, E)

(C, D) (E, F)

(B, G) (B, D)

(A, B), (A, C), (A, E), (D, G)

(B, D), (B, C), (B, G), (E, F)

Merge and local algo

Page 26: BotGraph: Large Scale Spamming Botnet Detection

26

Connected Component Extraction

(A, B), (A, C), (A, E), (D, G)

(B, D), (B, C), (B, G), (E, F)

(A, B), (A, C), (A, D), (A, E), (A, F), (A, G)

Page 27: BotGraph: Large Scale Spamming Botnet Detection

27

Connected Component Extraction

• Analysis– M partitions and log(M) steps– Partition size ≤ N (number of users) – Overall communication overhead

• O(N·log(M))

– Computational overhead

Page 28: BotGraph: Large Scale Spamming Botnet Detection

28

Outline

• Motivation and Problem Definition

• BotGraph Algorithms– History based algorithm on Signup detection– Graph-based algorithm on login detection

• Parallel Implementation on DryadLINQ

• Detection Results

• Discussion

• Conclusion

Page 29: BotGraph: Large Scale Spamming Botnet Detection

29

Detection of Signup Abuse

Page 30: BotGraph: Large Scale Spamming Botnet Detection

30

Detection by User-user Graph

Page 31: BotGraph: Large Scale Spamming Botnet Detection

31

Validations

• Manual Check– Verified by Hotmail group

• Comparison with Known Spamming Users– Complained Hotmail accounts

• Email Sending Patterns– Email size

• False Positive Estimation– Naming pattern– Signup time

Page 32: BotGraph: Large Scale Spamming Botnet Detection

32

Comparison with Complained Users

• Ks : known spammer accounts signed up in the studied month

• H : set of bot-users detected by EWMA

Page 33: BotGraph: Large Scale Spamming Botnet Detection

33

Comparison with Complained Users

• Ks : known spammer accounts that log in from at least 2 ASes

• L : set of bot-users detected by user-user graph

Page 34: BotGraph: Large Scale Spamming Botnet Detection

34

Validation of Sending Pattern

Page 35: BotGraph: Large Scale Spamming Botnet Detection

35

False Positive Estimation (1)

• Naming Pattern– Clear pattern in names of (current) bot-users

• E.g. w9168d4dc8c5c25f9

– Naming pattern score• The largest fraction of users that follow a single na

ming template from a regular expression pool• The regular expressions don’t quite match normal

user names

Page 36: BotGraph: Large Scale Spamming Botnet Detection

36

False Positive Estimation (1)

• Naming Score

Page 37: BotGraph: Large Scale Spamming Botnet Detection

37

False Positive Estimation (1)

• Naming Score– A majority of the bot-user groups have close t

o 1 naming pattern scores– A few small bot-user groups with scores lower

than 95%– In total, 0.44% of identified bot-users do not st

rictly follow the naming templates of their corresponding groups.

– Take this 0.44% as false positive bound

Page 38: BotGraph: Large Scale Spamming Botnet Detection

38

False Positive Estimation (2)

• Signup dates of the detected bot-users– Conservatively take all the accounts signed u

p before 2007 as legitimate– 0.08% bot-users were signed up before year

2007– Among all the accounts in the 2008-dataset, a

bout 59.1% of accounts were signed up before 2007

– False positive • Assuming normal users' behaviors don’t change• 0.08% / 59.1% = 0.13%

Page 39: BotGraph: Large Scale Spamming Botnet Detection

39

Outline

• Motivation and Problem Definition

• BotGraph Algorithms– History based algorithm on Signup detection– Graph-based algorithm on login detection

• Parallel Implementation on DryadLINQ

• Detection Results

• Discussion

• Conclusion

Page 40: BotGraph: Large Scale Spamming Botnet Detection

40

Evasion

• Signup detection– Be stealthy

• Login detection– Fixed binding

• Low utilization rate• Bot-accounts bound to one host are easy to be grouped

– Fixed AS assignment• Redefine the edge weight to consider IP prefix• Similar to fixed binding

– Be stealthy (sending as few emails as normal user)

Page 41: BotGraph: Large Scale Spamming Botnet Detection

41

Related Work

• Botnet Detection– Hard in general– HoneyNet

• Content-based Spam Detection– Bayesian filtering, AutoRE– Countermeasures: good words, image

• Behavior-based Spam Detection– SpamTracker

Page 42: BotGraph: Large Scale Spamming Botnet Detection

42

Conclusions

• BotGraph– History-based change detection on Signup– Graph-based component to detect stealthy bot-user lo

gins

• Parallel Algorithms on DryadLINQ– Quick process of huge Hotmail log

• Detection– Detect more than 26M bot-accounts in two-month log– Low false positive

Page 43: BotGraph: Large Scale Spamming Botnet Detection

43

Q & A?

Thanks!