cisc 879 - machine learning for solving systems problems presented by: ashwani rao dept of computer...

18
CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning to Detect and Identify Malicious Executables in Wild J. Zico Kotler Marcus A Maloof

Upload: hector-robertson

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Presented by: Ashwani RaoDept of Computer & Information Sciences

University of Delaware

Learning to Detect and Identify Malicious Executables in Wild

J. Zico KotlerMarcus A Maloof

Page 2: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Introduction

• Machine learning and data mining to identify malicious code

• Malicious Codes ?

• Why not antivirus suites?

• Training set: 1971 good and 1651 malicious executables

• Features extracted: n-gram byte code and executable based on their functions of payload

• Learning algorithms: naïve bayes, SVM, decision trees and boosting

Page 3: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Goals of the research Paper

• How to use established methods to detect and classify malicious executables ?

• Present empirical results from an extensive study of inductive methods for detection and classification

• To show that methods achieve high detection rates on new and unseen executables.

Page 4: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Related Work

• Lo et al., 1995; Kephart et al., 1995; Tesauro et al.,1996;Schultz et al.,2001

• Lo et al., 1995: analysis of several programs

• Schultz et al.2001, used data mining to detect

• Binary profiling (Ripper learning)

• String Sequences (Naïve Bayes)

• Hex dumps (six naïve bayesian classifiers)

Page 5: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Data Collection and Classification methods• 1971 benign and 1651 malicious executables of

windows pe format

• N-grams: Combine each four bye sequence into single term. For e.g.: ff 00 ab 3e 12 b3 , the corresponding n-grams are ff00ab3e, 00ab3e12, ab3e12b3 etc.

• N-gram: each of them are considered as attributes

• Most relevant attribute (n-grams) are calculated using Information gain also called average mutual information. Collected 500 most relevant n-grams

Page 6: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Classification methods

Page 7: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Classification methods

• Instance based learner: Collection of training examples

• Naive bayes: Probablisitc model. Based on condition probability of each class P(Ci) and P(Vj | Ci)

Page 8: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Classification methods

• Support Vector machines: vector of weights w and threshold,b. Uses a kernel function to map training data into higher dimensioned space so that problem is linearly separable.

• Decision Trees: Internal nodes correspond to attributes and leaf nodes corresponds to class labels.

• Boosted classifiers: It is method for combining multiple classifiers. Boosting produces set of weighted models by iteratively learning a model from a weighted data set, evaluating it and reweighting the data set based on model’s performance.

Page 9: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Detecting malicious code using n-grams

• Used Ten-fold cross validation

• Pilot Study: To determine the size of n-grams and number of n-grams relevant. Used n-grams with n=4 and calculated the best number of n-grams using Information gain. 500 relevant n-grams produced the best result.

• Experiment With Small collection: Small collection of executable with total of 68,744,909 n-grams

• Experiment with Large Collection: 255 million distinct n-grams of size of 4.

Page 10: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Results of Small Collection

• ROC curve for detecting malicious executables in small collection

Page 11: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Result of Bigger Collection

• ROC Curve for bigger collection

Page 12: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Classifying executables by Payload function• Extent to which classification methods could

determine whether a given malicious executable opened a backdoor, mass mailed or was an executable virus.

• Identify and enumerate the functions of payloads

• Many executables fell into many categories

• Experimental design similar to previous but for each of the fucntion data set is made from malicious executables only.

• Used ten fold Cross validation

Page 13: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Experimental Results

• ROC curve for mass mailing capabilities

Page 14: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Experimental Results

• ROC Curve for backdoor entries

Page 15: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Evaluating Real World Online Performance• Applied method to 291 real world malicious code to

discovered after the original data were gathered

• Classifiers from the original data were build for both benign and malicious code

• Boosted decision tree detected 98% of the new malicious code.

Page 16: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Conclusion and Future work

• Machine learning and data mining are useful and appropriate tool for detection of malware

• Boosted Classifiers, support vector machines performed exceptionally well

• Boosting removes bias and variance and outperformed other classifiers in the study

• This approach is scalable

• 20-25 % of the codes were obfuscated using compression and encryption

• For functions of payload experiments remove obfuscation and rerun the experiments with larger set

Page 17: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Conclusion and Future Work

• Similarity of malicious code and how such executables change over time. Clustering can provide good insight into this.

• This approach combined with search for known signatures, executing and analyzing code in virtual machine will provide better computer security

Page 18: CISC 879 - Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning

CISC 879 - Machine Learning for Solving Systems Problems

Q&A ?