using machine learning in security - icsisadia/talks/sadia-intel.pdf · machine learning overview....
TRANSCRIPT
Using Machine Learning in SecuritySadia Afroz
University of California, Berkeley
My Research
• Stylometry
• Malware Analysis
• Internet Freedom
Impact• ACM SIGSAC Dissertation award, 2014 (Runner up)
• Outstanding research in Privacy (PET) Award, 2013
• Best paper award, PETS 2012
• Free software: JStylo, Anonymouth, Doppelgänger Finder
• My work on Stylometry has been used by FBI
• My work on malware analysis is being deployed at McAfee
This Talk
• Application of Machine Learning in Security
• Machine Learning under Attack
• Machine Learning in Noisy Environment
Application of Machine Learning: A case study on Stylometry
• Can we link different identities using writing style?
Underground Forums
Understand the adversaries:
• How information flows in a cybercriminal forum? • How the cybercrime done?
Underground Forums
Underground Forums
Buying malware
Underground Forums
Buying malware
Buying crypter
Underground Forums
Buying malware Selling email/password
Buying crypter
Underground Forums
Buying malware Selling email/password
Buying crypter
Underground Forums
9
Buying malware
Selling email/password
Buying crypter
Underground Forums
9
Buying malware
Selling email/password
Buying crypter
Underground Forums
Writing style analysis: Stylometry
Writing style analysis: Stylometry
Regional differences: Couch vs Sofa
Writing style analysis: Stylometry
Regional differences: Couch vs Sofa
Similar meaning but different words: Although vs Though
Cormac McCarthy
Ernest Hemingway
Machine Learning Overview
What's the bravest thing you ever did?
He spat in the road a bloody phlegm. Getting up this morning, he said.
Cormac McCarthy
Ernest Hemingway
Machine Learning Overview
What's the bravest thing you ever did?
He spat in the road a bloody phlegm. Getting up this morning, he said.
He no longer dreamed of storms, nor of women, nor
of great occurrences, nor of great fish, nor fights, nor
contests of strength, nor of his wife.
Cormac McCarthy
Ernest Hemingway
Machine Learning Overview
What's the bravest thing you ever did?
He spat in the road a bloody phlegm. Getting up this morning, he said.
He no longer dreamed of storms, nor of women, nor
of great occurrences, nor of great fish, nor fights, nor
contests of strength, nor of his wife.
Cormac McCarthy
Ernest Hemingway
Extract features
Extract features
Machine Learning Overview
What's the bravest thing you ever did?
He spat in the road a bloody phlegm. Getting up this morning, he said.
He no longer dreamed of storms, nor of women, nor
of great occurrences, nor of great fish, nor fights, nor
contests of strength, nor of his wife.
Cormac McCarthy
Ernest Hemingway
Extract features
Extract features
Freq of function words
Freq of punctuations
Machine Learning Overview
What's the bravest thing you ever did?
He spat in the road a bloody phlegm. Getting up this morning, he said.
He no longer dreamed of storms, nor of women, nor
of great occurrences, nor of great fish, nor fights, nor
contests of strength, nor of his wife.
Cormac McCarthy
Ernest Hemingway
Extract features
Extract features
Freq of function words
Freq of punctuations
Machine Learning Overview
What's the bravest thing you ever did?
He spat in the road a bloody phlegm. Getting up this morning, he said.
He no longer dreamed of storms, nor of women, nor
of great occurrences, nor of great fish, nor fights, nor
contests of strength, nor of his wife.
Cormac McCarthy
Ernest Hemingway
Extract features
Extract features
Freq of function words
Freq of punctuations
Model
Machine Learning Overview
Just remember that the things you put into your head are
there forever, he said. You might want to think about that.
Test document
Machine Learning Overview
Just remember that the things you put into your head are
there forever, he said. You might want to think about that.
ModelWho wrote this?
Test document
Machine Learning Overview
Just remember that the things you put into your head are
there forever, he said. You might want to think about that.
Model
Cormac McCarthy
Who wrote this?
Test document
Machine Learning Overview
Important Concepts
• Feature extraction
• Ground truth
Underground forums
4 leaked forums
Underground forums
4 leaked forums
Antichat BlackhatWorld (Russian) (English)
Carders(German)
Password cracking Blackhat seo Accounts41k users 8k users 8k users
L33tCrew(German)Accounts12k users
Analyzing writing style is challenging!
• Feature extraction is hard:
• Foreign language, slang, l33tsp3ak, bad spelling
1337 down? **Neh, die Lösung!** Ne klappt nit, denke mal eher das sie mal wieder DNS probleme haben
Analyzing writing style is challenging!
• No ground truth
Doppelgänger Finder: Cluster accounts that belong to a user
Author A
Doppelgänger Finder: Cluster accounts that belong to a user
Author A Author B
Doppelgänger Finder: Cluster accounts that belong to a user
Author A Author B
P(A wrote B’s doc)
Doppelgänger Finder: Cluster accounts that belong to a user
Author A Author B
P(A wrote B’s doc)
P(B wrote A’s doc)
Doppelgänger Finder: Cluster accounts that belong to a user
Author A Author B
P(A wrote B’s doc)
P(B wrote A’s doc)
Combined score, T = P(A wrote Bd)* P(B wrote Ad)
A and B are the same author if T > threshold
Doppelgänger Finder: Cluster accounts that belong to a user
Author A Author B
P(A wrote B’s doc)
P(B wrote A’s doc)
Combined score, T = P(A wrote Bd)* P(B wrote Ad)
A and B are the same author if T > threshold
Doppelgänger Finder: Cluster accounts that belong to a user
Doppelgänger Finder: Cluster accounts that belong to a user
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
A
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
AB
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
AB D
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
AB C D
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
AB C D
E
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
A
B C DE
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
A
B C DE
Train
Model
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
A
B C DE
Train
ModelWho wrote these?
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
A
B C DE
Train
ModelWho wrote these?
P(D wrote Ad)P(C wrote Ad)
P(E wrote Ad)
P(B wrote Ad)
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
A
B C DE
Train
ModelWho wrote these?
P(D wrote Ad)
Repeat for every author
P(C wrote Ad)
P(E wrote Ad)
P(B wrote Ad)
The goal is to find the most probable author when the original author is not present.
Doppelgänger Finder: Cluster accounts that belong to a user
Data with ground truth
• 100 English blogs
• Written by 50 authors, 2 blogs per author
• Collected by crawling Google+ profiles of the authors (Narayanan et al. 2012)
Probability scores of true pairs > false pairs
Probability scores of true pairs > false pairs
No true pair after this
Probability scores of true pairs > false pairs
No true pair after this
Best threshold: True pair = 48, False pair= 5
Score on Carders
Score on Carders
Our chosen threshold: 21 pairs
Score on Carders
Our chosen threshold: 21 pairs
Manual analysis criteria• To verify we looked at
• Similar or not? : Username, ICQ, Signature, Contact information, Account information, Topics.
• Do they talk with each other?
• Others:
• Do they acknowledge their other accounts?
• Do they have common properties with some other users?
• Were they banned from the forum?
Combined probability score on Carders
Combined probability score on CardersUsernames: per**, Smi**Acknowledge, same ICQ, sell weed
Combined probability score on CardersUsernames: per**, Smi**Acknowledge, same ICQ, sell weed
Usernames: Pri**, Lou**Same ICQ, Topics
Combined probability score on Carders
Usernames: Kan**, deb**Same ICQ
Usernames: per**, Smi**Acknowledge, same ICQ, sell weed
Usernames: Pri**, Lou**Same ICQ, Topics
Combined probability score on Carders
Usernames: Kan**, deb**Same ICQ
Usernames: per**, Smi**Acknowledge, same ICQ, sell weed
Usernames: Pri**, Lou**Same ICQ, Topics
Usernames: Mr.**, Fle**Talk with each other
Combined probability score on Carders
Usernames: Kan**, deb**Same ICQ
Usernames: per**, Smi**Acknowledge, same ICQ, sell weed
Usernames: Pri**, Lou**Same ICQ, Topics
Usernames: Mr.**, Fle**Talk with each other
Usernames: puT**, pol**Nothing matches
Use of duplicate accounts• Sockpuppet:
• Raise fake demands for products
Use of duplicate accounts• Sockpuppet:
• Raise fake demands for products
I want to sell x
Use of duplicate accounts• Sockpuppet:
• Raise fake demands for products
I want to sell x OMG!!! I’ll buy them all
Use of duplicate accounts
• Accounts for sale:
• Normal accounts: 10 €, 2nd level: 25 €, 3rd Level 50€
Summary: Doppelgänger Finder
• Code: https://github.com/sheetal57/doppelganger-finder
• Paper: Doppelgänger Finder: Taking Stylometry To The Underground. IEEE S&P 2014
• Future:
• Darpa Memex Challenge
Machine Learning under Attack: How to Evade Stylometry
• Write less
• Write differently
Evading Stylometry: write differently
• We studied two ways to change writing style:
• Imitation: imitate Cormac McCarthy
• Obfuscation: writing in a different way
• We collected regular and deceptive documents using Amazon Mechanical Turk
Accuracy in regular documents is high
0.00
0.25
0.50
0.75
1.00
Number of Authors5 10 15 20 25 30 35 40
9-Feature (NN)Synonym-BasedWriteprints Baseline (SVM)Random
Accuracy in obfuscated writing
0.00
0.25
0.50
0.75
1.00
Number of Authors5 10 15 20 25 30 35 40
9-Feature (NN)Synonym-BasedWriteprints Baseline (SVM)Random
Accuracy in imitated writing
0.00
0.25
0.50
0.75
1.00
Number of Authors5 10 15 20 25 30 35 40
9-Feature (NN)Synonym-BasedWriteprints Baseline (SVM)Random
Evading Stylometry
• Evading stylometry is possible
• Maintaining a consistent fake writing style is hard
Real World Example: A Gay Girl in Damascus
“Amina Arraf”
Real World Example: A Gay Girl in Damascus
Fake picture (copied from Facebook)
“Amina Arraf”
Real World Example: A Gay Girl in Damascus
Fake picture (copied from Facebook)
The real “Amina”
“Amina Arraf”
Real World Example: A Gay Girl in Damascus
Fake picture (copied from Facebook)
The real “Amina”
“Amina Arraf”
Real World Example: A Gay Girl in Damascus
Thomas MacMaster A 40-year old American male
Fake picture (copied from Facebook)
The real “Amina”
“Amina Arraf”
I live in Damascus, Syria. It's a repressive police state. Most LGBT people are still deep in the closet or staying
as invisible as possible. But I have set up a blog announcing my sexuality, with my name and my photo.
Am I crazy? Maybe.
Real World Example: A Gay Girl in Damascus
Can Thomas fool Machine Learning?
Can Thomas fool Machine Learning?
Thomas (as himself)
Can Thomas fool Machine Learning?
Thomas (as himself)
Thomas (as Amina)
Can Thomas fool Machine Learning?
Thomas (as himself)
Thomas (as Amina)
Random
Can Thomas fool Machine Learning?
Thomas (as himself)
Thomas (as Amina)
Random
Can Thomas fool Machine Learning?
Model
Thomas (as himself)
Thomas (as Amina)
Random
A Gay Girl in Damascus
Can Thomas fool Machine Learning?
A Gay Girl in Damascus
Who wrote this?
Can Thomas fool Machine Learning?
ModelA Gay Girl in Damascus
Who wrote this?
Can Thomas fool Machine Learning?
ModelA Gay Girl in Damascus
Who wrote this?
54% posts
Can Thomas fool Machine Learning?
ModelA Gay Girl in Damascus
Who wrote this?
54% posts
43% posts
Can Thomas fool Machine Learning?
Summary: Evading Stylometry
• Machine learning methods perform poorly under attack
• Need to understand the cost of adversary
Website
Machine Learning in Noisy Environment: Website Fingerprinting Attack
Website
Machine Learning in Noisy Environment: Website Fingerprinting Attack
Website
Machine Learning in Noisy Environment: Website Fingerprinting Attack
Where is Alice going?
Website
Where is Alice going?
Machine Learning in Noisy Environment: Website Fingerprinting Attack
Website
Where is Alice going?
Machine Learning in Noisy Environment: Website Fingerprinting Attack
Website
Where is Alice going?
Machine Learning in Noisy Environment: Website Fingerprinting Attack
Website
Where is Alice going?
Machine Learning in Noisy Environment: Website Fingerprinting Attack
How does it work?
How does it work?
How does it work?
How does it work?
How does it work?
How does it work?
How does it work?
Extract features
Extract features
How does it work?
Model
Extract features
Extract features
How does it work?
How does it work?
Some page
How does it work?
Some page
How does it work?
Model
Some pageWhat is this page?
How does it work?
Model
Some pageWhat is this page?
Why is WF so important?
● Tor as the most advanced anonymity network
● Allows an adversary to discover the browsing history
● Series of successful attacks
● Low cost to the adversary
Number of top conference publications
on WF (25)
How practical is this attack?
How practical is this attack?
How practical is this attack?
Visit sites
How practical is this attack?
Visit sites Collect packets
How practical is this attack?
Visit sites Collect packets
Train model
How practical is this attack?
Visit sites Collect packets
Train model
Test model
Website
Unrealistic assumptions
Website
Unrealistic assumptions
Website
Unrealistic assumptions
Adversary: e.g., replicability
ControlTest (0.5s)
77.08%
9.8% 7.9% 8.23%
Test (3s)Test (5s)
10
Website fingerprinting attack with multi-tab browsing
ControlTest (0.5s)
77.08%
9.8% 7.9% 8.23%
Test (3s)Test (5s)
10
Time
BW
Tab 2Tab 1
Website fingerprinting attack with multi-tab browsing
● Coexisting Tor Browser Bundle (TBB) versions
● Versions: 2.4.7, 3.5 and 3.5.2.1 (changes in RP, etc.)
Website fingerprinting attack with different Tor versions
● Coexisting Tor Browser Bundle (TBB) versions
● Versions: 2.4.7, 3.5 and 3.5.2.1 (changes in RP, etc.)
Control (3.5.2.1)
Test (2.4.7)
Test (3.5)
79.58%66.75%
6.51%
11
Website fingerprinting attack with different Tor versions
Website
Unrealistic assumptions
Web: e.g., staleness
Accuracy (%)
Time (days)
Website Staleness
Accuracy (%)
Time (days)
Less than 50% after 9d.
Website Staleness
Theoretical Accuracy vs. Practical Accuracy
A Critical Evaluation of Website Fingerprinting Attacks. ACM CCS 2014
Importance of the Result
• Need for right evaluation
• Focus on important problem
Summary
• Application of Machine Learning in Security
• Machine Learning under Attack
• Machine Learning in Noisy Environment
Future Work
• Machine Learning in Security: Adversarially robust features
• Machine Learning in analyzing cybercriminal network