bayesian bot detection based on dns traffic similarity ricardo villamarín-salomón, josé carlos...

27
Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of Pittsburgh SAC '09, Proceedings of the 2009 ACM symposium on Applied Computing 陳陳陳 1

Upload: grace-craig

Post on 23-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

1

Bayesian Bot Detection Based on DNS Traffic Similarity

Ricardo Villamarín-Salomón, José Carlos BrustoloniDepartment of Computer Science University of Pittsburgh

SAC '09, Proceedings of the 2009 ACM symposium on Applied Computing

陳怡寧

Page 2: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

2

Outline

• Introduction• Bayesian method• Methodology• Experimental results• Discussion and limitations• Conclusion

Page 3: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

3

Introduction -- Problem

• Many botnets have centralized command and control (C&C) servers with fixed IP address or domain names.

• In such botnets, Bots can be detected by their communication with hosts whose IP address or domain name is that of a known C&C server.

• To evade detection, botmasters are increasingly obfuscating C&C communication, e.g., using fast-flux or P2P.

Page 4: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

4

Introduction -- Goal

• Hypothesis: – Regardless of obfuscation, commands tend to

cause similar activities in bots belonging to a same botnet.

– Through which they can be distinguished from other hosts.

• Assume at least one bot in a botnet is known. • Then using the Bayesian approach to find

other hosts with similar DNS traffic.

Page 5: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

5(1) Query FQDN

(2) Ask B1

(3) Query B1

(4) Ask M

how to answ

er

(5) Answ

er B2

(6) Answer B2

B1: Name servers B2: Web servers

Normal dns server Normal host

(7) HTTP GET

(9) Response malicious website

(8)GET redirection

(10) Download website

M: mothership

Analyze domains queried

Page 6: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

6

Bayesian method (1/4)

• B: blacklist (domain name of known C&C server)• DI : domain names queried by hosts in Hbl (hosts in the

blacklist B)• DN : domain names queried by hosts in H-Hbl

Hbl

Hu Hsq

Uninfected hosts Infected hosts but not in Hbl

Page 7: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

7

Bayesian method (2/4)

1. Assign a score to every q Q indicating a ∈probability that a host making it is infected

2. Assign to each host a score that combines the scores of all the queries it made.

Page 8: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

8

• qj : query

• Ihi: whether the host hi is infected

• The probability that a host hi will send query qj

Bayesian method (3/4)

Page 9: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

9

• Assume P(Ihi=1) = o.5

• An extreme case– If the only host querying the said domain belongs to Hbl, Sh(qj) will be 1

(and 0 if h doesn’t belongs to Hbl)– So we need tune this value…

Bayesian method (4/4)

Page 10: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

10

• Beta distribution is a continuous probability distributions defined on the interval (0, 1) parameterized by two positive shape parameters, α and β.

• The tuning calculation is based on – Observed DNS traffic– x : the a prior belief that a domain name that was never

queried before will be queried by an infected host.

Beta distribution (1/2)

Page 11: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

11

Beta distribution (2/2)

• n : the number of trials• s : number of successes involving q

• Nqj : the total number of times a query qj has been made during the traffic

monitoring period.• f = α + β , a constant interpreted as the strength we want to give to x.

• α = f *x

• f=1, x = 0.5, Nqj = 0, the result will be 0.5 => avoiding extreme value

Page 12: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

12

Select indicators

• Previous studies [14][15] show that robust indicators are obtained by taking the geometric mean of the host’s most extreme S’h(q) values (closet to 0 and 1).

[14] Gary Robinson, “Spam Detection”, [Online] http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html[15] Greg Louis, “Bogofilter Calculations: Comparing GeometricMean with Fisher’s Method for Combining Probabilities,” [Online] http://www.bgl.nu/bogofilter/fisher.html

Page 13: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

13

• N(h) and I(h) indicate how likely it is that a host is infected or non-infected, respectively.

• Combined score definition:

• Modify C(h) so that we can get a score between 0 and 1

• P(h) indicates our degree of belief that a host is infected.

Combined score

Page 14: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

14

Methodology• In this experiments, they use two sets.

(1) computers that they know with certainty to be infected. (run variant of the same bot in computer under control to collecting DNS traffic of infected host)

(2) hosts they confidently know to be uninfected.

• In infected host set, we altered traces to let the hosts to be masked (others that are unmodified => unmasked hosts).

• We apply Bayesian method to the merge traces and observe(1) which uninfected hosts were classified as such(2) which masked hosts were identified as infected, based on non-

blacklisted names that both masked and unmasked hosts queried.

Page 15: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

15

Blacklist and Bot Specimens

• Malware sample : MWCollect • Blacklist of C&C server : Shadowserver

• Bot selection– Had the same name in both VirusTotal and Kaspersky antivirus– Contacted same known C&C server– Had distinct MD5 signatures

• Backdoor.Win32.SdBot.cmz• Net-Worm.Win32.Bobic.k

Page 16: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

16

DNS Data Collection

• Uninfected hosts– CSL-1: 89 PCs in instructional laboratories of Pittsburgh

university, February 13-14, 2008– CSL-2: 89 PCs in instructional laboratories of Pittsburgh

university, February 14-15, 2008

• Infected hosts – sandnet + a DNS server + bot specimens

Page 17: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

17

Page 18: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

18

Test Traces

• Altered traces: obfuscation names by appending to them a non-existent ccTLD (.nv) to each blacklisted name.

• SdBot-V1-1-T : the traces of all infected hosts except SdBot-V1-1 are altered.

Page 19: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

19

Evaluation Metrics

• Recall, or True Positive Rate (TPR)

• False Positive Rate

Page 20: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

20

Experimental Results

• We wanted to find parameters that could yield good classification results with trace CSL-1-SdBot-T, and then see if these same parameters were effective in trace CSL-2-Bobic.k-T.

• We set Th=0.95, P(Ih)=0.5, and threshold of P(h) to be 0.9.

• How about Tl?

Page 21: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

21

Selecting Tl

Page 22: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

22

FPR & TPR

Page 23: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

23

True Positive

• TP is caused by the name ad.doubleclick.net which was queried by 0.87% of the uninfected hosts and the only

misclassified masked hosts.

Page 24: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

24

CSL-2-Bobic.k-T

Page 25: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

25

Discussion and Limitation

• FP occurs: – If the parameters are not well tuned– If a domain name is queried only by an infected

hosts and one or a few of the uninfected hosts.• FN occurs:– If the parameters are not well tuned– While very popular domain names during a time

period are queried by both infected and uninfected hosts.

Page 26: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

26

Conclusion

• Proposed and evaluated a Bayesian method for botnet detection.

• In this study, we found that the technique successfully recognized C&C servers with multiple domain names, while at the same time generating few or no false positives.

Page 27: Bayesian Bot Detection Based on DNS Traffic Similarity Ricardo Villamarín-Salomón, José Carlos Brustoloni Department of Computer Science University of

27

Comments

• The sample size of DNS traffic of infected hosts is too small.

• Are parameters of Bayesian method really suitable for all kinds of bots?

• We can use the bots found by M8000 as seeds and collect DNS traffic to find other unspecified infected hosts.