malware detection
DESCRIPTION
Malware detectionTRANSCRIPT
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Efficient Malware Classification Techniques
Arno Pol
September 18, 2014
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
About malware
Any software used to disrupt computer operation, gathersensitive information, or gain access to private computersystems.
Persistence common
Malware can exist in multiple locations, for instance RAM,on disk, and even in CMOS
Most malware is undetected or easily hidden
Encryption and packers are used to transform malware
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
About malware detection
Millions of malicious binaries
Thousands of new binaries each year
Massive cost to operations
Corporate espionage, online theft, spam
Contrary to popular belief bugs arent rare
Have to beat packers and Encryption
The usual approach of hashing is not sufficient
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Malware detection techniques
Signature-based techniques
Behavoir based techniques
Dynamic analysis techniques
Data-mining methods can apply to all of the above
Strong selection for low false-positive rate neccesary
Deletion of important data and system files can bedevastating
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Signature-based techniques
Binary analysis
Static (source code) analysis
Entry point, Strings, IAT, section table
I/O Analysis (Includes network API)
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Behavoir-based techniques
System log analysis
Network traffic analysis
Power consumption analysis
System call monitoring
Filesystem IO monotiring
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Dynamic analysis techniques
Taint analysis
Memory analysis
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Guilt by association
Minig file-relation graphs
A man is known by the company he keeps
A file is known by the files that appear with it on themachine
Symantecs polonium
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Polonium
Problem In Chair Not In Computer
Users that do not follow good security practice oftendownload lots of malware
Create a bipartite graph between machines and files, eachedge a file existing on a machine
This gives information about a files badness
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
AESOP
Unlike polonium captures File to file relations
File identifiers are SHA-256 hashes
Poloniums machine identification is imperfect, serialnumbers
AESOP uses norton community watch data
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
AESOP
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Co-occurence Strength
Jacccard similarity
Co-occurence strength between sets Mf i and Mf j
Filter out any files with a tiny presence, below treshold
Expensive to calculate
Locally sensitive hashing (For instance hamming distance)
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Locally Sensitive Hashing (LSH)
Randomly reorder machines into set M
Generate a set of minhash values, and separate into bands
Apply random permutation function during hashing
Order into buckets depending on similarity
If file appears in a bucket, high chance it co-occurs with atleast one file in that bucket
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Locally Sensitive Hashing continued
Effect of the number of bands b, and MinHash values ineach band, r on co-occurence
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Putting it all together
Create bipartite graph, with edges from files to buckets.
If two files are often connected trough the buckets, theyare more likely to be strongly co-occuring
Apply belief propagation to the markov random field wejust created.
With buckets labeled 0.5, bad files 0.01 and good files 0.99
We now have an idea about which files are good and bad
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Results
Labeling 1.6 million unlabeled files
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Results Continued
Labeling unknown malware
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Data dump
2552 benign and 1202 malware samples
Heap commit default value 4096
Stack reserve median appears to be set by msvc compiler(40000h)
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Data dump
Number of sections can be reduced in malware (packers?)
Malware tends to have more imports
The datestamp corresponds to dataset (2011)
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Data dump
Entropy very similar
Entry points tend to be closer(packers/encryption/optimisation?)
Weird section alignment
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Data dump
Weird file alignment more common
Lower image versions
Sometimes odd heap reserves and smaller stack usage
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Preliminary analysis
Dataset from 2011, of course timestamp is a strongclassifier, however useless
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Modified tree
Much more sensible
Stack reserve and image base strong classifiers
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Modified tree
15% misclassification rate
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Code substrings
Malware uses bugs, sometimes encryption and packing
Identify pieces of code that do things the malware needs
And pieces of malware have in common
O(n2) on large datasets
Subdivide the code sections into segments
Hash those segments, and compare.
Use bloom filter to filter matches.
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Algorithm outline
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Building classifier
Remove subsequences that are common in non-malware
Count predicitve value of subsequences on training set
To classify, walk trough a binary comparing each n-lengthsubsequence to database
Using a max length of 10, step of 4 bytes, running time isO(10N log(N))
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
ROC curve
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
ROC Explanation
False positives increase past 70% mark
Undesirable, may damage system
Ways to circumvent this, hashing, applied in modernvirusscans
Strong classifier of unknown malware
Even with a small dataset
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Precision-recall graph
-
EfficientMalware
ClassificationTechniques
Arno Pol
Introduction
Guilt byassociation
Malwaredataset
Decision tree
Miningsubstrings
Questions
Hopefully you enjoyed the presentation.
If there are any questions about the presentation, now isthe time to ask.
IntroductionGuilt by associationMalware datasetDecision treeMining substrings