measurement and classification of humans and bots in internet chat

34
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory

Upload: chessa

Post on 08-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Measurement and Classification of Humans and Bots in Internet Chat. Jhih-sin Jheng 2009/09/01. Machine Learning and Bioinformatics Laboratory. Reference. Measurement and Classification of Humans and Bots in Internet Chat Steven Gianvecchio, Mengjun Xie, ZhenyuWu, and Haining Wang - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Measurement and Classification of Humans and Bots in Internet Chat

Jhih-sin Jheng2009/09/01

Machine Learning and Bioinformatics Laboratory

Page 2: Measurement and Classification of Humans and Bots in Internet Chat

Reference

Measurement and Classification of Humans and Bots in Internet ChatSteven Gianvecchio, Mengjun Xie, ZhenyuWu, and Haining WangDepartment of Computer ScienceThe College of William and Mary(USENIX Security),2008

2

Page 3: Measurement and Classification of Humans and Bots in Internet Chat

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

3

Page 4: Measurement and Classification of Humans and Bots in Internet Chat

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

4

Page 5: Measurement and Classification of Humans and Bots in Internet Chat

Chat Bots vs. BotNetsBotNets – networks of compromised machines

some use chat systems (IRC) for C&C, others use P2P, HTTP, etc.

abuse various systemsChat Bots – automated chat programs

some are helpful, e.g., chat loggerscan abuse chat systems and their users

Send spam ,spread malicious software , mount phishing attacks

Our focus is on the Yahoo! Chat system.

5

Page 6: Measurement and Classification of Humans and Bots in Internet Chat

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

6

Page 7: Measurement and Classification of Humans and Bots in Internet Chat

MeasurementAugust-November 2007 – we collect data

August 2007 – Yahoo! adds CAPTCHAvery few chat bots

October 2007 – bots are back

7

Page 8: Measurement and Classification of Humans and Bots in Internet Chat

MeasurementAugust and November 2007

many chat bots1,440 hours of chat logs147 chat logs21 chat rooms

8

Page 9: Measurement and Classification of Humans and Bots in Internet Chat

MeasurementTo create our dataset, we read and label the

chat users ashuman, bot, or ambiguous

In total, we recognized 14 different types of chat botsdifferent triggering mechanismsdifferent text generation techniques

9

Page 10: Measurement and Classification of Humans and Bots in Internet Chat

Types of Chat BotsPeriodic Bots – sends messages based on

periodic timersRandom Bots – sends messages based on

random timersResponder Bots – responds to messages of

other usersReplay Bots – replays messages of other

users

10

Page 11: Measurement and Classification of Humans and Bots in Internet Chat

Humansinter-message delay – evidence of heavy tailmessage size – well fit by Exponential

(λ=0.034)

11

Page 12: Measurement and Classification of Humans and Bots in Internet Chat

Periodic Botsinter-message delay – several clusters with

high probabilitiesmessage size – messages built from templates

approximate a normal distribution

12

Page 13: Measurement and Classification of Humans and Bots in Internet Chat

Random Botsinter-message delay – Equilikely distribution at

40, 64, and 88; Uniform distribution 45-125message size – messages selected from a small

database

13

Page 14: Measurement and Classification of Humans and Bots in Internet Chat

Responder Botsinter-message delay – human-like timingmessage size – multiple templates of different

lengths

14

Page 15: Measurement and Classification of Humans and Bots in Internet Chat

Replay Botsinter-message delay – cluster with high

probabilities (replay bots are periodic)message size – human-like size, well fit by

Exponential (λ=0.028)

15

Page 16: Measurement and Classification of Humans and Bots in Internet Chat

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

16

Page 17: Measurement and Classification of Humans and Bots in Internet Chat

Classification SystemEntropy Classifier

detects abnormal behaviorbased on message sizes and inter-message

delaysaccurate but slow

Machine Learning Classifierdetects “learned” patternsbased on message contentfast but must be trained

17

Page 18: Measurement and Classification of Humans and Bots in Internet Chat

18

Observation – chat bots are less complex than humans, and thus, lower in entropyexploits the low entropy of chat bots

Corrected Conditional Entropy Test (CCE)estimates higher-order entropy

Entropy Test (EN)estimates first-order entropy

Entropy Classifier

18

Page 19: Measurement and Classification of Humans and Bots in Internet Chat

Machine Learning ClassifierObservation - chat spam like email spam is a

text classification problemexploits message content of chat bots

CRM114a powerful text classification system

19

Page 20: Measurement and Classification of Humans and Bots in Internet Chat

20

Hybrid Classification System entropy classifier builds and maintains

the bot corpus machine learning classifier uses the bot

and human corpora

BOT CORPUS

CLASSIFY AS CHAT BOT

HUMAN CORPUS

CLASSIFY AS HUMAN

INPUT

ENTROPY CLASSIFIER

MACHINE LEARNING

CLASSIFIER

Page 21: Measurement and Classification of Humans and Bots in Internet Chat

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

21

Page 22: Measurement and Classification of Humans and Bots in Internet Chat

Experimental EvaluationTypes of Chat Bots

Periodic BotsRandom BotsResponder BotsReplay Bots

Classifiersentropy classifier – 100 messagesmachine learning classifier – 25 messages

22

Page 23: Measurement and Classification of Humans and Bots in Internet Chat

Experimental EvaluationClassification Tests

Ent – entropy classifier SupML – fully-supervised ML classifier, trained

on AUG BOTSSupMLre – fully-supervised ML classifier,

retrained on NOV BOTSEntML – entropy-trained ML on AUG BOTS

23

Page 24: Measurement and Classification of Humans and Bots in Internet Chat

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

EN(imd) 121/121 68/68 1/30 51/51 109/109 40/40 7/1713

CCE(imd) 121/121 49/68 4/30 51/51 109/109 40/40 11/1713

EN(ms) 92/121 7/68 8/30 46/51 34/109 0/40 7/1713

CCE(ms) 77/121 8/68 30/30 51/51 6/109 0/40 11/1713

OVERALL 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

24

Entropy Classifier EN – entropy CCE – corrected conditional entropy (imd) – inter-message delay (ms) – message size

Page 25: Measurement and Classification of Humans and Bots in Internet Chat

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

EN(imd) 121/121 68/68 1/30 51/51 109/109 40/40 7/1713

CCE(imd) 121/121 49/68 4/30 51/51 109/109 40/40 11/1713

EN(ms) 92/121 7/68 8/30 46/51 34/109 0/40 7/1713

CCE(ms) 77/121 8/68 30/30 51/51 6/109 0/40 11/1713

OVERALL 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

25

EN(imd) and CCE(imd) problems against responder bots detect most other chat bots

Page 26: Measurement and Classification of Humans and Bots in Internet Chat

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

EN(imd) 121/121 68/68 1/30 51/51 109/109 40/40 7/1713

CCE(imd) 121/121 49/68 4/30 51/51 109/109 40/40 11/1713

EN(ms) 92/121 7/68 8/30 46/51 34/109 0/40 7/1713

CCE(ms) 77/121 8/68 30/30 51/51 6/109 0/40 11/1713

OVERALL 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

26

EN(ms) and CCE(ms) problems against random and replay

bots detect most other chat bots

Page 27: Measurement and Classification of Humans and Bots in Internet Chat

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

EN(imd) 121/121 68/68 1/30 51/51 109/109 40/40 7/1713

CCE(imd) 121/121 49/68 4/30 51/51 109/109 40/40 11/1713

EN(ms) 92/121 7/68 8/30 46/51 34/109 0/40 7/1713

CCE(ms) 77/121 8/68 30/30 51/51 6/109 0/40 11/1713

OVERALL 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

27

OVERALL detects all chat bots false positive rate is ~0.01 100 messages

Page 28: Measurement and Classification of Humans and Bots in Internet Chat

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

Ent 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

SupML 121/121 68/68 30/30 14/51 104/109 1/40 0/1713

SupMLre 121/121 68/68 30/30 51/51 109/109 40/40 0/1713

EntML 121/121 68/68 30/30 51/51 109/109 40/40 1/1713

28

Entropy and Machine Learning Classifiers Ent – entropy classifier (from last slide) SupML – fully-supervised ML classifier,

trained on AUG BOTS SupMLre – fully-supervised ML

classifier, retrained on NOV BOTS EntML – entropy-trained ML on AUG

BOTS

Page 29: Measurement and Classification of Humans and Bots in Internet Chat

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

Test TP TP TP TP TP TP FP

Ent 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

SupML 121/121 68/68 30/30 14/51 104/109 1/40 0/1713

SupMLre 121/121 68/68 30/30 51/51 109/109 40/40 0/1713

EntML 121/121 68/68 30/30 51/51 109/109 40/40 1/1713

29

Ent OVERALL results from previous slide

Page 30: Measurement and Classification of Humans and Bots in Internet Chat

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

Ent 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

SupML 121/121 68/68 30/30 14/51 104/109 1/40 0/1713

SupMLre 121/121 68/68 30/30 51/51 109/109 40/40 0/1713

EntML 121/121 68/68 30/30 51/51 109/109 40/40 1/1713

30

SupML has problems against November bots needs to be retrained for new bots

SupMLre detects all bots

Page 31: Measurement and Classification of Humans and Bots in Internet Chat

AUG BOTS NOV BOTS

periodic random respond periodic random replay human

test TP TP TP TP TP TP FP

Ent 121/121 68/68 30/30 51/51 109/109 40/40 17/1713

SupML 121/121 68/68 30/30 14/51 104/109 1/40 0/1713

SupMLre 121/121 68/68 30/30 51/51 109/109 40/40 0/1713

EntML 121/121 68/68 30/30 51/51 109/109 40/40 1/1713

31

EntML false positive rate is ~0.0005

(Ent is ~0.01) 25 messages

Page 32: Measurement and Classification of Humans and Bots in Internet Chat

OutlineBackgroundMeasurementClassification SystemExperimental EvaluationConclusion

32

Page 33: Measurement and Classification of Humans and Bots in Internet Chat

ConclusionMeasurements

overall, chat bots are less complex than humans

some chat bots more human-likeClassification System

exploits benefits of both classifiersquickly classifies known chat botsaccurately classifies unknown chat bots

33

Page 34: Measurement and Classification of Humans and Bots in Internet Chat

Thank you !