malware detection

Upload: alicia-schneider

Post on 10-Oct-2015

72 views

Category:

Documents


1 download

DESCRIPTION

Malware detection

TRANSCRIPT

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Efficient Malware Classification Techniques

    Arno Pol

    September 18, 2014

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    About malware

    Any software used to disrupt computer operation, gathersensitive information, or gain access to private computersystems.

    Persistence common

    Malware can exist in multiple locations, for instance RAM,on disk, and even in CMOS

    Most malware is undetected or easily hidden

    Encryption and packers are used to transform malware

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    About malware detection

    Millions of malicious binaries

    Thousands of new binaries each year

    Massive cost to operations

    Corporate espionage, online theft, spam

    Contrary to popular belief bugs arent rare

    Have to beat packers and Encryption

    The usual approach of hashing is not sufficient

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Malware detection techniques

    Signature-based techniques

    Behavoir based techniques

    Dynamic analysis techniques

    Data-mining methods can apply to all of the above

    Strong selection for low false-positive rate neccesary

    Deletion of important data and system files can bedevastating

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Signature-based techniques

    Binary analysis

    Static (source code) analysis

    Entry point, Strings, IAT, section table

    I/O Analysis (Includes network API)

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Behavoir-based techniques

    System log analysis

    Network traffic analysis

    Power consumption analysis

    System call monitoring

    Filesystem IO monotiring

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Dynamic analysis techniques

    Taint analysis

    Memory analysis

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Guilt by association

    Minig file-relation graphs

    A man is known by the company he keeps

    A file is known by the files that appear with it on themachine

    Symantecs polonium

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Polonium

    Problem In Chair Not In Computer

    Users that do not follow good security practice oftendownload lots of malware

    Create a bipartite graph between machines and files, eachedge a file existing on a machine

    This gives information about a files badness

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    AESOP

    Unlike polonium captures File to file relations

    File identifiers are SHA-256 hashes

    Poloniums machine identification is imperfect, serialnumbers

    AESOP uses norton community watch data

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    AESOP

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Co-occurence Strength

    Jacccard similarity

    Co-occurence strength between sets Mf i and Mf j

    Filter out any files with a tiny presence, below treshold

    Expensive to calculate

    Locally sensitive hashing (For instance hamming distance)

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Locally Sensitive Hashing (LSH)

    Randomly reorder machines into set M

    Generate a set of minhash values, and separate into bands

    Apply random permutation function during hashing

    Order into buckets depending on similarity

    If file appears in a bucket, high chance it co-occurs with atleast one file in that bucket

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Locally Sensitive Hashing continued

    Effect of the number of bands b, and MinHash values ineach band, r on co-occurence

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Putting it all together

    Create bipartite graph, with edges from files to buckets.

    If two files are often connected trough the buckets, theyare more likely to be strongly co-occuring

    Apply belief propagation to the markov random field wejust created.

    With buckets labeled 0.5, bad files 0.01 and good files 0.99

    We now have an idea about which files are good and bad

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Results

    Labeling 1.6 million unlabeled files

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Results Continued

    Labeling unknown malware

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Data dump

    2552 benign and 1202 malware samples

    Heap commit default value 4096

    Stack reserve median appears to be set by msvc compiler(40000h)

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Data dump

    Number of sections can be reduced in malware (packers?)

    Malware tends to have more imports

    The datestamp corresponds to dataset (2011)

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Data dump

    Entropy very similar

    Entry points tend to be closer(packers/encryption/optimisation?)

    Weird section alignment

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Data dump

    Weird file alignment more common

    Lower image versions

    Sometimes odd heap reserves and smaller stack usage

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Preliminary analysis

    Dataset from 2011, of course timestamp is a strongclassifier, however useless

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Modified tree

    Much more sensible

    Stack reserve and image base strong classifiers

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Modified tree

    15% misclassification rate

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Code substrings

    Malware uses bugs, sometimes encryption and packing

    Identify pieces of code that do things the malware needs

    And pieces of malware have in common

    O(n2) on large datasets

    Subdivide the code sections into segments

    Hash those segments, and compare.

    Use bloom filter to filter matches.

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Algorithm outline

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Building classifier

    Remove subsequences that are common in non-malware

    Count predicitve value of subsequences on training set

    To classify, walk trough a binary comparing each n-lengthsubsequence to database

    Using a max length of 10, step of 4 bytes, running time isO(10N log(N))

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    ROC curve

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    ROC Explanation

    False positives increase past 70% mark

    Undesirable, may damage system

    Ways to circumvent this, hashing, applied in modernvirusscans

    Strong classifier of unknown malware

    Even with a small dataset

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Precision-recall graph

  • EfficientMalware

    ClassificationTechniques

    Arno Pol

    Introduction

    Guilt byassociation

    Malwaredataset

    Decision tree

    Miningsubstrings

    Questions

    Hopefully you enjoyed the presentation.

    If there are any questions about the presentation, now isthe time to ask.

    IntroductionGuilt by associationMalware datasetDecision treeMining substrings