how to build realistic machine learning systems for security?...how to build realistic machine...

How to Build Realistic Machine Learning Systems for Security?

Sadia Afroz ICSI and Avast

Rajarshi Gupta Avast

Machine Learning is necessary for detecting malware at scale

Evtimov, Ivan, et al. (2017). ”Robust physical-world attacks on deep learning models."

arXiv preprint arXiv:1707.08945.

Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples.

arXiv preprint arXiv:1412.6572.

…but Machine Learning is unreliable, inexplicable and easily fooled

Is machine learning useful for security?

Malware + Benign

Features

Model

Extract features

Train a model

Let’s build a malware detector using machine learning

Malware + Benign

Features

Model

Extract features

Train a model


New file Malware

Quality of the data ==> Quality of the model

Malware + Benign

Features

Model

Extract features

Train a model

New file Malware


!8

CODE SAMPLE

!9

Is this malware?

CODE SAMPLE

!10

CODE SAMPLE X

!11

Is this malware?

CODE SAMPLE X

!12

The answer depends on WHO you ask and WHEN you askIs this malware?

CODE SAMPLE X

!13

X According to VirusTotal…

CODE SAMPLE

https://www.virustotal.com/gui/file/3120b563781b5ead9fdebc906818836329f362bf8e3ea7ee3dbfd4ceb0ebd8dd/detection

!13


Sep 2019

CODE SAMPLE


!13


~42% AVs considered it malware

Sep 2019

CODE SAMPLE


!13



Jan 2020Sep 2019

CODE SAMPLE


!13




Jan 2020Sep 2019

CODE SAMPLE


How can we protect users from malware when we don’t know what malware is?

Malware

Run the file

Analyze (static +dynamic)

What is malware?

Users’ machine

Malware

Run the file


What is malware?

Virtual machine

Malware

Run the file


What is malware?

Sandbox

Malware

Run the file


What is malware?

Malware is highly suspicious files

Sandbox

Malware

Run the file


What is malware?

Malware is highly suspicious filesToo time consuming!

Sandbox

What is malware?Solution: Get labels from other sources

We studied 40 papers from 2001-2019 to check where they get their ground truth from

What is malware?Solution: Get labels from other sources

01020304050

Collection AV Label Manual

We studied 40 papers from 2001-2019 to check where they get their ground truth from

01020304050


What is malware?

We studied 40 papers from 2001-2019 to check where

they get their ground truth from

01020304050


What is malware?

9 use labels by one AV



01020304050


What is malware?


2 papers: Malware >=4, Benign == 0



01020304050


What is malware?



2 papers: Malware >=5, Benign <=1



01020304050


What is malware?




1 paper: Malware >=10, Benign == 0



01020304050


What is malware?





1 paper: Malware == ALL, Benign == 0



01020304050


What is malware?






1 paper: Malware == Majority, Benign == 0



01020304050


What is malware?






1 paper: Malware == Majority, Benign == 0

1 paper: Malware == Weighted Majority, Benign == 0



How to compare different approaches?

What is malware?A

ISec

201

5

What is malware?

• Number of very large and professional companies share their labels on VirusTotal

AIS

ec 2

015

What is malware?

• Number of very large and professional companies share their labels on VirusTotal

• Great correlation in general, especially for top companies• 96% agreement after 3 days• 99% agreement after 3 weeks

AIS

ec 2

015

Professional Heuristics for Ground Truth

# of days since first occurrence of sample

Avast Results (100k samples in Sep 2019)

Our (professional) rule of thumb of malware ground truth: One week delayed results on VT from Top Few (<10) companies is good enough

Does the overall performance of the classifiers matter?


Which of the classifiers are best?

Which of the classifiers are best?

Depends upon where you look!


Adversarial attacks

Graph credit: Nicholas Carlini, Google Brain;

More than 1500 papers on adversarial ML

Adversarial attacks

Adversarial attacks

Graph credit: Nicholas Carlini, Google Brain;

More than 1500 papers on adversarial ML

Only 36 (2.4%) papers focus on evading malware detectors

Can adversarial malware evade malware detectors?

Can adversarial malware evade malware detectors?

Are adversarial attacks harmful for users?

Extract features 0 1 1 0

1 1 1 1

1 0 0 0

0 0 0 0

1 1 1 1

Feature vector

Adversarial attacksAdversarial attacks: feature space vs problem space


1 1 1 1

1 0 0 0

0 0 0 0

1 1 1 1

Feature vector

Evading Machine Learning Model



1 1 1 1

1 0 0 0

0 0 0 0

1 1 1 1

Feature vector

Evading Machine Learning ModelChecking Harm to Users


New Section+ =New Section


New Section+ =New Section


The new section can override an existing section

When adding a new section at the end of the last section, if the sample has overlay data, the new section will overwrite the overlay data.



New section 4

New section 4

Section header


New section 4

Section headerNew section header

Override existing sections


Are adversarial attacks harmful to users?


papers changed the malware files



9/36



9/36papers tried

to execute the adversarialsamples



9/36papers tried


4/36



9/36papers tried


4/36papers check if the modified malware is harmful to users



9/36papers tried


4/36papers check if the modified malware is harmful to users

0/36

[1] Xu et al., NDSS Talk: Automatically Evading Classifiers (including Gmail’s).


* Hashes and hand written rules

Is evading one classifier enough?

Sample



Sample Signature*



Sample

Malware

Benign

Signature*



Static Sample

Malware

Benign

Not MatchedSignature*



Static Sample

Benign

Malware Malware

Benign




Static Sample

Benign

Maybe benign

Malware Malware

Benign




Static Sample

Benign

Maybe benign Dynamic

Malware Malware

Benign




Static Sample

Benign

Maybe benign Dynamic

Malware Malware

Benign

Malware

Benign




Static Sample

Benign

Maybe benign Dynamic Maybe Malware

More Analysis

Malware Malware

Benign

Malware

Benign




Static Sample

Benign

Maybe benign Dynamic Maybe Malware

More Analysis

Malware Malware

Benign

Malware

Benign


We are here



Who is the adversary?

Adversary has full access

Adversary has no access

White box

Black box




White box

Grey box

Black box




White box

Adversary has full access to the features

Grey box

Black box




White box


Adversary can dounlimited queries

Grey box

Black box




White box



Adversary has accessto the training data

Grey box

Black box




White box



Adversary has accessto the training data

Adversary can buildsubstitute classifiers

Grey box

Black box

Consistent ground truth

Measurable adversary

Proper evaluation

How to Build Realistic Machine Learning Systems for Security?

Questions?

Rajarshi Gupta VP, Head of AI

Avast

Deepali GargSenior Data Scientist

Avast

Fabrizio Bondi AI Manager

Avast

Heng YinAssociate Professor

UC Riverside

Wei SongPhD Student UC Riverside

Xuezixiang LiPhD Student UC Riverside

Research contributors

Sadia Afroz

[email protected]

mailto:[email protected]

how to build realistic machine learning systems for security?...how to build realistic machine...

Documents