applied machine learning in software security · applied machine learning in software security ......

32
1 Applied Machine Learning in Software Security August 10, 2017 © 2016 Carnegie Mellon University Approved for Public Release; Distribution is Unlimited Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 © 2016 Carnegie Mellon University Approved for Public Release; Distribution is Unlimited Applied Machine Learning in Software Security Eliezer Kanal

Upload: lyhuong

Post on 17-Jun-2018

233 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

1Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Software Engineering InstituteCarnegie Mellon UniversityPittsburgh, PA 15213

© 2016 Carnegie Mellon UniversityApproved for Public Release; Distribution is Unlimited

Applied Machine Learning in Software SecurityEliezer Kanal

Page 2: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

2Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Copyright 2017 Carnegie Mellon University

This material is based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center.

NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS” BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.

[Distribution Statement A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

This material may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at [email protected].

Carnegie Mellon® and CERT® are registered marks of Carnegie Mellon University.

DM-0004563

Page 3: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

3Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Tom Mitchell, former CMU Machine Learning department chair:

The field of Machine Learning asks the question, “How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?”

Machine Learning seeks to automate data analysis and inference.

What is Machine Learning?

Page 4: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

4Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

What is Machine Learning?

If your problem can be stated as either of the following:

…you would likely benefit from machine learning.

Iwouldliketouse_____datatopredict_____.Iwouldliketouse____data

toguesswhat____is.

Page 5: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

5Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

What is Machine Learning?

Sample Techniques:

• Regression

• K-Means Clustering

Clustering image: Weston.pace, https://commons.wikimedia.org/wiki/File:K_Means_Example_Step_4.svg

Page 6: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

6Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

What is Machine Learning?

Feature Engineering:Using existing data to create more informative data

DataTypes

Image StaticVideo

Timeseries FinancialdataEvent counts

Structuredtext WebformsStructureddata

(JSON,XML)Sourcecode

Freetext NewsTweetsEmail

manymore…

Page 7: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

7Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

What is Machine Learning?

Examples:

o I would like to use incident ticket data to predictcustomer needs .

o I would like to use publicly available code to predict what code I will write .

o I would like to use bug report data to guess the location of undetected bugs in my code .

Page 8: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

8Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Page 9: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

9Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

What is Machine Learning?

Examples:

o I would like to use incident ticket data to predictcustomer needs .

o I would like to use publicly available code to predict what code I will write .

o I would like to use bug report data to guess the location of undetected bugs in my code .

Page 10: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

10Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML:Vulnerability Detection

Page 11: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

11Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Vulnerability Detection

Analyzer

Analyzer

Analyzer

Codebases

Alerts Today3,147

11,772

48,690

0

10,000

20,000

30,000

40,000

50,000

60,000

TP FP Susp

Manyalertsleftunaudited!

Page 12: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

12Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Vulnerability Detection

Analyzer

Analyzer

Analyzer

Codebases

Alerts Today

OurGoal

3,147

11,772

48,690

0

10,000

20,000

30,000

40,000

50,000

60,000

TP FP Susp

66effortdays

12,076

45,172

6,361

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

50,000

e-TP e-FP I

AutomatedStatisticalClassifier

• ExpectedTruePositive(e-TP)• ExpectedFalsePositive(e-FP)• Indeterminate(I)

Page 13: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

13Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Vulnerability Detection

ClassifiersLassoLogisticRegressionCARTRandomForestExtremeGradientBoosting(XGBoost)

Some ofthefeatures usedAnalysistoolsused Tokensinfunc/methodSignificantLOC Alertsinfunc/methodComplexity AlertsinfileCoupling MethodsinfileCohesion SLOCinfileSEIcodingrule Avg TokensFunction/methodlength Avg SLOCSLOCinfunc/method Depthincoderepository#parametersinfunc/meth.

Cyclomatic complexity(func/meth)

Page 14: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

14Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Vulnerability Detection

Significant improvement!

• 91% Classifier accuracy overall

• Specific rule accuracy at right

• 10x developer time saved!

RuleID LassoLRRandomForest CART XGBoost

INT31-C 98% 97% 98% 97%EXP01-J 74% 74% 81% 74%OBJ03-J 73% 86% 86% 83%FIO04-J* 80% 80% 90% 80%EXP33-C* 83% 87% 83% 83%EXP34-C* 67% 72% 79% 72%DCL36-C* 100% 100% 100% 100%ERR08-J* 99% 100% 100% 100%IDS00-J* 96% 96% 96% 96%ERR01-J* 100% 100% 100% 100%ERR09-J* 100% 88% 88% 88%

* Small quantity of data

RuleID LassoLRRandomForest CART XGBoost

INT31-C 98% 97% 98% 97%EXP01-J 74% 74% 81% 74%OBJ03-J 73% 86% 86% 83%FIO04-J* 80% 80% 90% 80%EXP33-C* 83% 87% 83% 83%EXP34-C* 67% 72% 79% 72%DCL36-C* 100% 100% 100% 100%ERR08-J* 99% 100% 100% 100%IDS00-J* 96% 96% 96% 96%ERR01-J* 100% 100% 100% 100%ERR09-J* 100% 88% 88% 88%

Page 15: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

15Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML:Malware family classification

Page 16: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

16Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Malware family classification

ReverseEngineering

Discovery

Refinement

Reflection

File

NewFamily

ArtifactCatalog

Signature 1Signature 2Signature 3

Files 1a 1b 1c 1d …Files 2a 2b 2c 2d …Files 3a 3b 3c 3d …

Whichfileshang

together?

Page 17: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

17Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Malware family classification

SignalFlowgraphhighlightsbehaviorrelatingdifferent

malwarefamilies

Programinstructionanalysisshowssimilarityanddiversionofbehavior

StaticAnalysisidentifiesprogramswithsimilarsourcecode

Page 18: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

18Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Malware family classification

Simplify visualization of extremely complex data through the use of dimensionality reduction and associated visualization techniques

Chernoff faceexperiment

t-SNE

Page 19: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

19Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Malware family classification

• Ground Truth: SVM trained with expert ground truth labels.

• Turkers Avg: Classifier trained with layperson labels.

Performance surprisingly similar!

Page 20: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

20Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML:Software cost estimation

RobertFerguson,DennisGoldenson,JamesMcCurley,RobertW.Stoddard,DavidZubrow,DebraAnderson.“QuantifyingUncertaintyinEarlyLifecycleCostEstimation(QUELCE)”.Dec2011.http://resources.sei.cmu.edu/library/asset-view.cfm?assetid=10039

Page 21: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

21Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Software cost estimation

GeneralAccountingOffice.DefenseAcquisitions:AKnowledge-BasedFundingApproachCouldImproveMajorWeaponSystemProgramOutcomes.ReporttotheCommitteeonArmedServices,U.S.Senate,July2008,GAO-08-619.

Page 22: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

22Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Software cost estimation

Page 23: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

23Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Software cost estimation

Page 24: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

24Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Software cost estimation

Page 25: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

25Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML:Incident report mapping

Page 26: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

26Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Incident report mapping

International partners

Agency infrastructure

Threat actors

Phishing campaign

Agency response

teams US-CERT infrastructure

Malware class

Compromised website

Common reference websites

Observation

Landscape Tickets

Inference

Page 27: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

27Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Applied ML: Incident report mapping

Page 28: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

28Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Indicators across ticketsIndicatorsoccurwithdiversepatternsacrosstickets,reportersandtime.Timeonxaxis,countonyaxis,colorcodedbyreporter.

MaliciousIP

AgencyIP

US-CERTdomain

Page 29: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

29Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Similarity of indicatorsBeginningwithareferenceindicator,wefindindicatorssimilartoit.Example:amaliciousIP

• Coloredcirclesaretickets• Greycirclesareindicators• Largeindicatorsnear

centerofcirclehavesimilaroccurrencepatternstothereferenceindicator.

Page 30: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

30Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Indicator communitiesButwhatifwearen’tstartingwithareferenceindicator?Weassumethatindicatorsgeneratedbyacoherentrealworldprocesswillbemorelikelytoco-occurinticketsthanarbitrarypairsofindicators.Findgroupsofhighlysimilarindicatorsincompleteindicator-ticketgraph.

Page 31: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

31Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Indicator-ticket graph

Asubsetoftheticket-indicatorgraph(forasmallsetofselectedindicators)• Ticketsaregreytriangles• Indicatorsareblackcircles• Edgesconnectticketstotheindicatorsthey

contain

Page 32: Applied Machine Learning in Software Security · Applied Machine Learning in Software Security ... “How can we build computer ... you would likely benefit from machine learning

32Applied Machine Learning in Software SecurityAugust 10, 2017© 2016 Carnegie Mellon University

Approved for Public Release; Distribution is Unlimited

Contact Information

Eliezer KanalTechnical ManagerTelephone: +1 412.268.5204Email: [email protected]