Download - Libor Mořkovský - Recognizing Malware
Recognizing malwareLibor Mořkovský
Computer virus
bacterial cellbased on work by Anderson Brito
Computer virus
executable fi le
entry point
Computer virusInserting code into fi les is never “good”.
executable fi le
entry point
image courtesy ofLooking Glass Studios
MalwareHow do you recognize a thief?
image courtesy ofLooking Glass StudiosTwentieth Century Fox
MalwareHow do you recognize a thief?
MalwareHow do you recognize a thief?
image courtesy ofLooking Glass StudiosTwentieth Century FoxParamount Pictures
Malwarecompletely different behaviors are considered “bad”we need a judge to decide who crossed the line
••
Malware | Many facesunlike real thieves, malware can be duplicatednot only duplicated, but also modifiedall this is done by machinestoo much work to judge each one manually
••••
Finding similar files
ooooooooooooo oo o
oo
oooo
ooooooooo
o
ooooo
o
ooooooooo
o
o
MDS1
MD
S2
class
oooooooo
CLEAN
MALWARE
QUERY
UNKNOWN
Finding similar filesneed a file representationneed a distance function
••
Finding similar files | File vectoreach executable file is represented by a feature vectorthe PE format is complex, so we keep exactly one version of the extractor code (C++) the vector comprises static and dynamic features, the exact content is proprietary
••
•
Database record • One record = constant vector of over 100 attributes
• the “file fingerprint” • Each attribute has a data type and semantic
Attribute Data Type Semantic
sha256 32 byte array CHECKSUM
pe_sect_cnt uint16_t VALUE
pe_sect_rawoff_entry uint32_t OFFSET
• The complete contents of the vector are kept secret • static and dynamic features of PE executables
Finding similar files | Distancesum of partial distanceseach distance operator assigned manuallyweights assigned manually to equalize contribution
•••
Nearest neighbor query
• Compound distance function • Data type and semantic determine partial dist. func.
Data Type Semantic Partial distance function
32 byte array CHECKSUM RETURN_ZERO
uint16_t VALUE EQUAL_RET32
uint32_t OFFSET LOG
• Each partial distance function = one kernel function • Over 100 kernels for every NN query
• Intermediate results kept in the “Scratchpad”
Finding similar files | Data~60 M data pointssparse and well separated (in many cases)
••
Finding similar files | Implementationwe started with GPUstheir high memory throughput allows “naive” implementation and rapid prototypingcolumn-oriented database
••
•
Classification | Requirementsfind easily what is responsible for a mistake – transparency fix the problem quickly – tractability
•
•
Classification | AlgorithmInstance based classifier.
Classifi cation | Optimizationsscaling and HW problems with GPUswe invested in algorithmic optimizations:VP-tree, distance bounded searchhand optimized distance function (assembly)CPU version is ~100x faster
••
••
Classification | Deployment
→Fi
leSH
Aan
dus
er id →
←Fi
lepr
evale
nce←
←File
classification ←→
Filefingerprint →
← Generic detections ←
↑ File classifications and Evo-gen detections
→ Threats →
Set updates ↓
Medusa
Scavenger
Avast users
FileRep
Classification | Deployment
→Fi
leSH
Aan
dus
er id →
←Fi
lepr
evale
nce←
←File
classification ←→
Filefingerprint →
← Generic detections ←
↑ File classifications and Evo-gen detections
→ Threats →
Set updates ↓
Medusa
Scavenger
Avast users
FileRep
Classification | Deployment
→Fi
leSH
Aan
dus
er id →
←Fi
lepr
evale
nce←
←File
classification ←→
Filefingerprint →
← Generic detections ←
↑ File classifications and Evo-gen detections
→ Threats →
Set updates ↓
Medusa
Scavenger
Avast users
FileRep
Rule generatordetect more variants in the wild(our) rule is a conjunction of several conditionsknown as Win32:Evo-Gencompletely different optimization problem than classification - still uses the GPU
••••
Q&A