approximating attack surfaces with stack traces [icse 15]

Christopher Theisen†, Kim Herzig‡, Patrick Morrison†, Brendan Murphy‡, Laurie Williams†

†North Carolina State University‡Microsoft Research, Cambridge UK

Approximating Attack Surfaces with Stack Traces

1/17Introduction | Methodology | Results and Discussion | Future Work | Conclusion

Before we start…

What is the “Attack Surface” of a system?

Ex. early approximation of attack surface – Manadhata [2]:

Only covers API entry points

…easy to say, hard to define (practically).

The (OWASP) Attack Surface of an application is: [1]

1. …paths into and out of the application2. the code that protects these paths3. all valuable data used in the application4. the code that protects data

Introduction | Methodology | Results and Discussion | Future Work | Conclusion 2/17

[1] https://www.owasp.org/index.php?title=Attack_Surface_Analysis_Cheat_Sheet&oldid=156006[2] Manadhata, P., Wing, J., Flynn, M., & McQueen, M. (2006, October). Measuring the attack surfaces of two FTP daemons. In Proceedings of the 2nd ACM workshop on Quality of protection (pp. 3-10). ACM

Our goal is to aid software engineers in

prioritizing security efforts by

approximating the attack surface of a

system via stack trace analysis.


Proposed Solution

Stack traces represent user activity that puts the system under stress

There’s a defect of some sort; does it have security implications?

Stack traces may localize security flaws

Crashes caused by user activityBad input that was handled improperly, et cetera

Crashes are a DoS attack by definition; you brought the service or system down!

Hardware crashes are excluded


Research Questions

RQ1: How effectively can stack traces to be used to approximate the attack surface of a system?

RQ2: Can the performance of vulnerability prediction be improved by limiting the prediction space to the approximated attack surface?


OverviewCatalog all code that appears on stack traces


Data Sources

Introduction | Methodology | Results and Discussion | Future Work | Conclusion

[4] "Description of the Dr. Watson for Windows," Microsoft Corporation, [Online]. Available: http://support.microsoft.com/kb/308538/en-us.

7/17

Attack Surface Construction (RQ1)

Data source, Crash ID, binary [4000+], filename [100,000+], function [10,000,000+]

Crashes Provide:Binary

Function

foo!foobarDeviceQueueRequest+0x68foo!fooDeviceSetup+0x72foo!fooAllDone+0xA8bar!barDeviceQueueRequest+0xB6bar!barDeviceSetup+0x08bar!barAllDone+0xFFcenter!processAction+0x1034center!dontDoAnything+0x1030


Results (RQ1)

Fuzzing User Induced

Crashes

%binaries 0.9% 48.4%

%vulnerabilities 14.9% 94.6%

Microsoft targets fuzzing towards high-risk modulesWe are covering the majority of vulnerabilities seen!

Targeting different crashes gets different results


Prediction Models (RQ2)

We believe that the key for [improving prediction] is by:

(1) developing new prediction techniques that deal with the“needle in the haystack” problem

(2) finding new metrics that deal with the unique characteristics of vulnerabilities and attacks.

Zimmermann et al. study [3]:


[3] T. Zimmermann, N. Nagappan and L. Williams, "Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista," in Software Testing, Verification and Validation (ICST), 2010 Third International Conference on, 2010

10/17

Prediction Models (RQ2)

We believe that the key for [improving prediction] is by:

(1) developing new prediction techniques that deal with the“needle in the haystack” problem

(2) finding new metrics that deal with the unique characteristics of vulnerabilities and attacks.

Zimmermann et al. study [3]:

Stack traces point to where flawed code lives!


[3] T. Zimmermann, N. Nagappan and L. Williams, "Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista," in Software Testing, Verification and Validation (ICST), 2010 Third International Conference on, 2010

10/17

Prediction Model Construction (RQ2)

Replicated the VPM from Windows Vista study

Run the VPM with all files considered as possibly vulnerable

Repeat, but remove code not found on stack traces

Vulnerability Prediction Model (VPM)

29 metrics in 6 categories:

ChurnDependencyLegacy

CODEMINE data [5]

SizeDefectsPre-release vulnerabilities


[5] J. Czerwonka, N. Nagappan, W. Schulte and B. Murphy, "CODEMINE: Building a Software Development Data Analytics Platform at Microsoft," Software, IEEE, vol. 30, no. 4, pp. 64--71, 2013.

11/17

Results (RQ2)

Comparing the VPM run on all files vs. just attack surface files…

Precision improved from 0.5 to 0.69

Recall improved from 0.02 to 0.05

Statistical improvement

Practical? No.


Problems with Precision [6]

No. Low precision is fine in several situations.

When the cost of missing the target is prohibitively expensive.When only a small fraction [of] the data is returned.When there is little or no cost in checking false alarms.

Are low precision predictors unsatisfactory?

…especially on highly imbalanced datasets.

Recall and precision like to compete


[6] Tim Menzies, Alex Dekhtyar, Justin Distefano, and Jeremy Greenwald. 2007. Problems with Precision: A Response to "Comments on 'Data Mining Static Code Attributes to Learn Defect Predictors'". IEEE Trans. Softw. Eng. 33, 9 (September 2007)

Problems with Precision [6]

No. Low precision is fine in several situations.

When the cost of missing the target is prohibitively expensive.When only a small fraction [of] the data is returned.When there is little or no cost in checking false alarms.This seems appropriate for security flaws!

Are low precision predictors unsatisfactory?

…especially on highly imbalanced datasets.

Recall and precision like to compete


[6] Tim Menzies, Alex Dekhtyar, Justin Distefano, and Jeremy Greenwald. 2007. Problems with Precision: A Response to "Comments on 'Data Mining Static Code Attributes to Learn Defect Predictors'". IEEE Trans. Softw. Eng. 33, 9 (September 2007)

13/17

Lessons Learned - Visualizations


Destination files

“Sourc

e”

file

s

14/17

Limitations

Stack traces are a good metric for Windows 8…

Different levels of granularity? (File/Function)Smaller projects? Open source?Not operating systems?

Results don’t necessarily generalize

Other learners?

Oversampling and Undersampling?

What else can we do with VPM’s?


Future Work

What else can we do with stack traces?

Frequency of appearanceDependencies, not the entities themselvesHow many stack traces are required?Sliding window; how does the approximation change over time?

Additional Metrics

Visualization Plugin for IDEs

…does it actually help?

Tool Development



foo!foobarDeviceQueueRequest+0x68foo!fooDeviceSetup+0x72foo!fooAllDone+0xA8bar!barDeviceQueueRequest+0xB6bar!barDeviceSetup+0x08bar!barAllDone+0xFFcenter!processAction+0x1034center!dontDoAnything+0x1030

Conclusion

17/17

approximating attack surfaces with stack traces [icse 15]

Science

approximated attack

owasp attack surface

dos attack

proposed solution stack

stack trace analysis

security flaws crashes

hardware crashes

valuable data