using machine learning in security - icsisadia/talks/sadia-intel.pdf · machine learning overview....

Post on 11-Mar-2018

227 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Using Machine Learning in SecuritySadia Afroz

University of California, Berkeley

My Research

• Stylometry

• Malware Analysis

• Internet Freedom

Impact• ACM SIGSAC Dissertation award, 2014 (Runner up)

• Outstanding research in Privacy (PET) Award, 2013

• Best paper award, PETS 2012

• Free software: JStylo, Anonymouth, Doppelgänger Finder

• My work on Stylometry has been used by FBI

• My work on malware analysis is being deployed at McAfee

This Talk

• Application of Machine Learning in Security

• Machine Learning under Attack

• Machine Learning in Noisy Environment

Application of Machine Learning: A case study on Stylometry

• Can we link different identities using writing style?

Underground Forums

Understand the adversaries:

• How information flows in a cybercriminal forum? • How the cybercrime done?

Underground Forums

Underground Forums

Buying malware

Underground Forums

Buying malware

Buying crypter

Underground Forums

Buying malware Selling email/password

Buying crypter

Underground Forums

Buying malware Selling email/password

Buying crypter

Underground Forums

9

Buying malware

Selling email/password

Buying crypter

Underground Forums

9

Buying malware

Selling email/password

Buying crypter

Underground Forums

Writing style analysis: Stylometry

Writing style analysis: Stylometry

Regional differences: Couch vs Sofa

Writing style analysis: Stylometry

Regional differences: Couch vs Sofa

Similar meaning but different words: Although vs Though

Cormac McCarthy

Ernest Hemingway

Machine Learning Overview

What's the bravest thing you ever did?

He spat in the road a bloody phlegm. Getting up this morning, he said.

Cormac McCarthy

Ernest Hemingway

Machine Learning Overview

What's the bravest thing you ever did?

He spat in the road a bloody phlegm. Getting up this morning, he said.

He no longer dreamed of storms, nor of women, nor

of great occurrences, nor of great fish, nor fights, nor

contests of strength, nor of his wife.

Cormac McCarthy

Ernest Hemingway

Machine Learning Overview

What's the bravest thing you ever did?

He spat in the road a bloody phlegm. Getting up this morning, he said.

He no longer dreamed of storms, nor of women, nor

of great occurrences, nor of great fish, nor fights, nor

contests of strength, nor of his wife.

Cormac McCarthy

Ernest Hemingway

Extract features

Extract features

Machine Learning Overview

What's the bravest thing you ever did?

He spat in the road a bloody phlegm. Getting up this morning, he said.

He no longer dreamed of storms, nor of women, nor

of great occurrences, nor of great fish, nor fights, nor

contests of strength, nor of his wife.

Cormac McCarthy

Ernest Hemingway

Extract features

Extract features

Freq of function words

Freq of punctuations

Machine Learning Overview

What's the bravest thing you ever did?

He spat in the road a bloody phlegm. Getting up this morning, he said.

He no longer dreamed of storms, nor of women, nor

of great occurrences, nor of great fish, nor fights, nor

contests of strength, nor of his wife.

Cormac McCarthy

Ernest Hemingway

Extract features

Extract features

Freq of function words

Freq of punctuations

Machine Learning Overview

What's the bravest thing you ever did?

He spat in the road a bloody phlegm. Getting up this morning, he said.

He no longer dreamed of storms, nor of women, nor

of great occurrences, nor of great fish, nor fights, nor

contests of strength, nor of his wife.

Cormac McCarthy

Ernest Hemingway

Extract features

Extract features

Freq of function words

Freq of punctuations

Model

Machine Learning Overview

Just remember that the things you put into your head are

there forever, he said. You might want to think about that.

Test document

Machine Learning Overview

Just remember that the things you put into your head are

there forever, he said. You might want to think about that.

ModelWho wrote this?

Test document

Machine Learning Overview

Just remember that the things you put into your head are

there forever, he said. You might want to think about that.

Model

Cormac McCarthy

Who wrote this?

Test document

Machine Learning Overview

Important Concepts

• Feature extraction

• Ground truth

Underground forums

4 leaked forums

Underground forums

4 leaked forums

Antichat BlackhatWorld (Russian) (English)

Carders(German)

Password cracking Blackhat seo Accounts41k users 8k users 8k users

L33tCrew(German)Accounts12k users

Analyzing writing style is challenging!

• Feature extraction is hard:

• Foreign language, slang, l33tsp3ak, bad spelling

1337 down? **Neh, die Lösung!** Ne klappt nit, denke mal eher das sie mal wieder DNS probleme haben

Analyzing writing style is challenging!

• No ground truth

Doppelgänger Finder: Cluster accounts that belong to a user

Author A

Doppelgänger Finder: Cluster accounts that belong to a user

Author A Author B

Doppelgänger Finder: Cluster accounts that belong to a user

Author A Author B

P(A wrote B’s doc)

Doppelgänger Finder: Cluster accounts that belong to a user

Author A Author B

P(A wrote B’s doc)

P(B wrote A’s doc)

Doppelgänger Finder: Cluster accounts that belong to a user

Author A Author B

P(A wrote B’s doc)

P(B wrote A’s doc)

Combined score, T = P(A wrote Bd)* P(B wrote Ad)

A and B are the same author if T > threshold

Doppelgänger Finder: Cluster accounts that belong to a user

Author A Author B

P(A wrote B’s doc)

P(B wrote A’s doc)

Combined score, T = P(A wrote Bd)* P(B wrote Ad)

A and B are the same author if T > threshold

Doppelgänger Finder: Cluster accounts that belong to a user

Doppelgänger Finder: Cluster accounts that belong to a user

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

A

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

AB

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

AB D

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

AB C D

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

AB C D

E

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

A

B C DE

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

A

B C DE

Train

Model

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

A

B C DE

Train

ModelWho wrote these?

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

A

B C DE

Train

ModelWho wrote these?

P(D wrote Ad)P(C wrote Ad)

P(E wrote Ad)

P(B wrote Ad)

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

A

B C DE

Train

ModelWho wrote these?

P(D wrote Ad)

Repeat for every author

P(C wrote Ad)

P(E wrote Ad)

P(B wrote Ad)

The goal is to find the most probable author when the original author is not present.

Doppelgänger Finder: Cluster accounts that belong to a user

Data with ground truth

• 100 English blogs

• Written by 50 authors, 2 blogs per author

• Collected by crawling Google+ profiles of the authors (Narayanan et al. 2012)

Probability scores of true pairs > false pairs

Probability scores of true pairs > false pairs

No true pair after this

Probability scores of true pairs > false pairs

No true pair after this

Best threshold: True pair = 48, False pair= 5

Score on Carders

Score on Carders

Our chosen threshold: 21 pairs

Score on Carders

Our chosen threshold: 21 pairs

Manual analysis criteria• To verify we looked at

• Similar or not? : Username, ICQ, Signature, Contact information, Account information, Topics.

• Do they talk with each other?

• Others:

• Do they acknowledge their other accounts?

• Do they have common properties with some other users?

• Were they banned from the forum?

Combined probability score on Carders

Combined probability score on CardersUsernames: per**, Smi**Acknowledge, same ICQ, sell weed

Combined probability score on CardersUsernames: per**, Smi**Acknowledge, same ICQ, sell weed

Usernames: Pri**, Lou**Same ICQ, Topics

Combined probability score on Carders

Usernames: Kan**, deb**Same ICQ

Usernames: per**, Smi**Acknowledge, same ICQ, sell weed

Usernames: Pri**, Lou**Same ICQ, Topics

Combined probability score on Carders

Usernames: Kan**, deb**Same ICQ

Usernames: per**, Smi**Acknowledge, same ICQ, sell weed

Usernames: Pri**, Lou**Same ICQ, Topics

Usernames: Mr.**, Fle**Talk with each other

Combined probability score on Carders

Usernames: Kan**, deb**Same ICQ

Usernames: per**, Smi**Acknowledge, same ICQ, sell weed

Usernames: Pri**, Lou**Same ICQ, Topics

Usernames: Mr.**, Fle**Talk with each other

Usernames: puT**, pol**Nothing matches

Use of duplicate accounts• Sockpuppet:

• Raise fake demands for products

Use of duplicate accounts• Sockpuppet:

• Raise fake demands for products

I want to sell x

Use of duplicate accounts• Sockpuppet:

• Raise fake demands for products

I want to sell x OMG!!! I’ll buy them all

Use of duplicate accounts

• Accounts for sale:

• Normal accounts: 10 €, 2nd level: 25 €, 3rd Level 50€

Summary: Doppelgänger Finder

• Code: https://github.com/sheetal57/doppelganger-finder

• Paper: Doppelgänger Finder: Taking Stylometry To The Underground. IEEE S&P 2014

• Future:

• Darpa Memex Challenge

Machine Learning under Attack: How to Evade Stylometry

• Write less

• Write differently

Evading Stylometry: write differently

• We studied two ways to change writing style:

• Imitation: imitate Cormac McCarthy

• Obfuscation: writing in a different way

• We collected regular and deceptive documents using Amazon Mechanical Turk

Accuracy in regular documents is high

0.00

0.25

0.50

0.75

1.00

Number of Authors5 10 15 20 25 30 35 40

9-Feature (NN)Synonym-BasedWriteprints Baseline (SVM)Random

Accuracy in obfuscated writing

0.00

0.25

0.50

0.75

1.00

Number of Authors5 10 15 20 25 30 35 40

9-Feature (NN)Synonym-BasedWriteprints Baseline (SVM)Random

Accuracy in imitated writing

0.00

0.25

0.50

0.75

1.00

Number of Authors5 10 15 20 25 30 35 40

9-Feature (NN)Synonym-BasedWriteprints Baseline (SVM)Random

Evading Stylometry

• Evading stylometry is possible

• Maintaining a consistent fake writing style is hard

Real World Example: A Gay Girl in Damascus

“Amina Arraf”

Real World Example: A Gay Girl in Damascus

Fake picture (copied from Facebook)

“Amina Arraf”

Real World Example: A Gay Girl in Damascus

Fake picture (copied from Facebook)

The real “Amina”

“Amina Arraf”

Real World Example: A Gay Girl in Damascus

Fake picture (copied from Facebook)

The real “Amina”

“Amina Arraf”

Real World Example: A Gay Girl in Damascus

Thomas MacMaster A 40-year old American male

Fake picture (copied from Facebook)

The real “Amina”

“Amina Arraf”

I live in Damascus, Syria. It's a repressive police state. Most LGBT people are still deep in the closet or staying

as invisible as possible. But I have set up a blog announcing my sexuality, with my name and my photo.

Am I crazy? Maybe.

Real World Example: A Gay Girl in Damascus

Can Thomas fool Machine Learning?

Can Thomas fool Machine Learning?

Thomas (as himself)

Can Thomas fool Machine Learning?

Thomas (as himself)

Thomas (as Amina)

Can Thomas fool Machine Learning?

Thomas (as himself)

Thomas (as Amina)

Random

Can Thomas fool Machine Learning?

Thomas (as himself)

Thomas (as Amina)

Random

Can Thomas fool Machine Learning?

Model

Thomas (as himself)

Thomas (as Amina)

Random

A Gay Girl in Damascus

Can Thomas fool Machine Learning?

A Gay Girl in Damascus

Who wrote this?

Can Thomas fool Machine Learning?

ModelA Gay Girl in Damascus

Who wrote this?

Can Thomas fool Machine Learning?

ModelA Gay Girl in Damascus

Who wrote this?

54% posts

Can Thomas fool Machine Learning?

ModelA Gay Girl in Damascus

Who wrote this?

54% posts

43% posts

Can Thomas fool Machine Learning?

Summary: Evading Stylometry

• Machine learning methods perform poorly under attack

• Need to understand the cost of adversary

Website

Machine Learning in Noisy Environment: Website Fingerprinting Attack

Website

Machine Learning in Noisy Environment: Website Fingerprinting Attack

Website

Machine Learning in Noisy Environment: Website Fingerprinting Attack

Where is Alice going?

Website

Where is Alice going?

Machine Learning in Noisy Environment: Website Fingerprinting Attack

Website

Where is Alice going?

Machine Learning in Noisy Environment: Website Fingerprinting Attack

Website

Where is Alice going?

Machine Learning in Noisy Environment: Website Fingerprinting Attack

Website

Where is Alice going?

Machine Learning in Noisy Environment: Website Fingerprinting Attack

How does it work?

How does it work?

How does it work?

How does it work?

How does it work?

How does it work?

How does it work?

Extract features

Extract features

How does it work?

Model

Extract features

Extract features

How does it work?

How does it work?

Some page

How does it work?

Some page

How does it work?

Model

Some pageWhat is this page?

How does it work?

Model

Some pageWhat is this page?

Why is WF so important?

● Tor as the most advanced anonymity network

● Allows an adversary to discover the browsing history

● Series of successful attacks

● Low cost to the adversary

Number of top conference publications

on WF (25)

How practical is this attack?

How practical is this attack?

How practical is this attack?

Visit sites

How practical is this attack?

Visit sites Collect packets

How practical is this attack?

Visit sites Collect packets

Train model

How practical is this attack?

Visit sites Collect packets

Train model

Test model

Website

Unrealistic assumptions

Website

Unrealistic assumptions

Website

Unrealistic assumptions

Adversary: e.g., replicability

ControlTest (0.5s)

77.08%

9.8% 7.9% 8.23%

Test (3s)Test (5s)

10

Website fingerprinting attack with multi-tab browsing

ControlTest (0.5s)

77.08%

9.8% 7.9% 8.23%

Test (3s)Test (5s)

10

Time

BW

Tab 2Tab 1

Website fingerprinting attack with multi-tab browsing

● Coexisting Tor Browser Bundle (TBB) versions

● Versions: 2.4.7, 3.5 and 3.5.2.1 (changes in RP, etc.)

Website fingerprinting attack with different Tor versions

● Coexisting Tor Browser Bundle (TBB) versions

● Versions: 2.4.7, 3.5 and 3.5.2.1 (changes in RP, etc.)

Control (3.5.2.1)

Test (2.4.7)

Test (3.5)

79.58%66.75%

6.51%

11

Website fingerprinting attack with different Tor versions

Website

Unrealistic assumptions

Web: e.g., staleness

Accuracy (%)

Time (days)

Website Staleness

Accuracy (%)

Time (days)

Less than 50% after 9d.

Website Staleness

Theoretical Accuracy vs. Practical Accuracy

A Critical Evaluation of Website Fingerprinting Attacks. ACM CCS 2014

Importance of the Result

• Need for right evaluation

• Focus on important problem

Summary

• Application of Machine Learning in Security

• Machine Learning under Attack

• Machine Learning in Noisy Environment

Future Work

• Machine Learning in Security: Adversarially robust features

• Machine Learning in analyzing cybercriminal network

top related