libor mořkovský - recognizing malware

24
Recognizing malware Libor Mořkovský

Upload: machine-learning-prague

Post on 16-Apr-2017

146 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Libor Mořkovský - Recognizing Malware

Recognizing malwareLibor Mořkovský

Page 2: Libor Mořkovský - Recognizing Malware

Computer virus

bacterial cellbased on work by Anderson Brito

Page 3: Libor Mořkovský - Recognizing Malware

Computer virus

executable fi le

entry point

Page 4: Libor Mořkovský - Recognizing Malware

Computer virusInserting code into fi les is never “good”.

executable fi le

entry point

Page 5: Libor Mořkovský - Recognizing Malware

image courtesy ofLooking Glass Studios

MalwareHow do you recognize a thief?

Page 6: Libor Mořkovský - Recognizing Malware

image courtesy ofLooking Glass StudiosTwentieth Century Fox

MalwareHow do you recognize a thief?

Page 7: Libor Mořkovský - Recognizing Malware

MalwareHow do you recognize a thief?

image courtesy ofLooking Glass StudiosTwentieth Century FoxParamount Pictures

Page 8: Libor Mořkovský - Recognizing Malware

Malwarecompletely different behaviors are considered “bad”we need a judge to decide who crossed the line

••

Page 9: Libor Mořkovský - Recognizing Malware

Malware | Many facesunlike real thieves, malware can be duplicatednot only duplicated, but also modifiedall this is done by machinestoo much work to judge each one manually

••••

Page 10: Libor Mořkovský - Recognizing Malware

Finding similar files

ooooooooooooo oo o

oo

oooo

ooooooooo

o

ooooo

o

ooooooooo

o

o

MDS1

MD

S2

class

oooooooo

CLEAN

MALWARE

QUERY

UNKNOWN

Page 11: Libor Mořkovský - Recognizing Malware

Finding similar filesneed a file representationneed a distance function

••

Page 12: Libor Mořkovský - Recognizing Malware

Finding similar files | File vectoreach executable file is represented by a feature vectorthe PE format is complex, so we keep exactly one version of the extractor code (C++) the vector comprises static and dynamic features, the exact content is proprietary

••

Database record • One record = constant vector of over 100 attributes

• the “file fingerprint” • Each attribute has a data type and semantic

Attribute Data Type Semantic

sha256 32 byte array CHECKSUM

pe_sect_cnt uint16_t VALUE

pe_sect_rawoff_entry uint32_t OFFSET

• The complete contents of the vector are kept secret • static and dynamic features of PE executables

Page 13: Libor Mořkovský - Recognizing Malware

Finding similar files | Distancesum of partial distanceseach distance operator assigned manuallyweights assigned manually to equalize contribution

•••

Nearest neighbor query

• Compound distance function • Data type and semantic determine partial dist. func.

Data Type Semantic Partial distance function

32 byte array CHECKSUM RETURN_ZERO

uint16_t VALUE EQUAL_RET32

uint32_t OFFSET LOG

• Each partial distance function = one kernel function • Over 100 kernels for every NN query

• Intermediate results kept in the “Scratchpad”

Page 14: Libor Mořkovský - Recognizing Malware

Finding similar files | Data~60 M data pointssparse and well separated (in many cases)

••

Page 15: Libor Mořkovský - Recognizing Malware

Finding similar files | Implementationwe started with GPUstheir high memory throughput allows “naive” implementation and rapid prototypingcolumn-oriented database

••

Page 16: Libor Mořkovský - Recognizing Malware

Classification | Requirementsfind easily what is responsible for a mistake – transparency fix the problem quickly – tractability

Page 17: Libor Mořkovský - Recognizing Malware

Classification | AlgorithmInstance based classifier.

Page 18: Libor Mořkovský - Recognizing Malware

Classifi cation | Optimizationsscaling and HW problems with GPUswe invested in algorithmic optimizations:VP-tree, distance bounded searchhand optimized distance function (assembly)CPU version is ~100x faster

••

••

Page 19: Libor Mořkovský - Recognizing Malware

Classification | Deployment

→Fi

leSH

Aan

dus

er id →

←Fi

lepr

evale

nce←

←File

classification ←→

Filefingerprint →

← Generic detections ←

↑ File classifications and Evo-gen detections

→ Threats →

Set updates ↓

Medusa

Scavenger

Avast users

FileRep

Page 20: Libor Mořkovský - Recognizing Malware

Classification | Deployment

→Fi

leSH

Aan

dus

er id →

←Fi

lepr

evale

nce←

←File

classification ←→

Filefingerprint →

← Generic detections ←

↑ File classifications and Evo-gen detections

→ Threats →

Set updates ↓

Medusa

Scavenger

Avast users

FileRep

Page 21: Libor Mořkovský - Recognizing Malware

Classification | Deployment

→Fi

leSH

Aan

dus

er id →

←Fi

lepr

evale

nce←

←File

classification ←→

Filefingerprint →

← Generic detections ←

↑ File classifications and Evo-gen detections

→ Threats →

Set updates ↓

Medusa

Scavenger

Avast users

FileRep

Page 22: Libor Mořkovský - Recognizing Malware

Rule generatordetect more variants in the wild(our) rule is a conjunction of several conditionsknown as Win32:Evo-Gencompletely different optimization problem than classification - still uses the GPU

••••

Page 23: Libor Mořkovský - Recognizing Malware
Page 24: Libor Mořkovský - Recognizing Malware

Q&A