approximating attack surfaces with stack traces [icse 15]
TRANSCRIPT
Christopher Theisen†, Kim Herzig‡, Patrick Morrison†, Brendan Murphy‡, Laurie Williams†
†North Carolina State University‡Microsoft Research, Cambridge UK
Approximating Attack Surfaces with Stack Traces
1/17Introduction | Methodology | Results and Discussion | Future Work | Conclusion
1/17Introduction | Methodology | Results and Discussion | Future Work | Conclusion
Before we start…
What is the “Attack Surface” of a system?
Ex. early approximation of attack surface – Manadhata [2]:
Only covers API entry points
…easy to say, hard to define (practically).
The (OWASP) Attack Surface of an application is: [1]
1. …paths into and out of the application2. the code that protects these paths3. all valuable data used in the application4. the code that protects data
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 2/17
[1] https://www.owasp.org/index.php?title=Attack_Surface_Analysis_Cheat_Sheet&oldid=156006[2] Manadhata, P., Wing, J., Flynn, M., & McQueen, M. (2006, October). Measuring the attack surfaces of two FTP daemons. In Proceedings of the 2nd ACM workshop on Quality of protection (pp. 3-10). ACM
Our goal is to aid software engineers in
prioritizing security efforts by
approximating the attack surface of a
system via stack trace analysis.
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 3/17
Proposed Solution
Stack traces represent user activity that puts the system under stress
There’s a defect of some sort; does it have security implications?
Stack traces may localize security flaws
Crashes caused by user activityBad input that was handled improperly, et cetera
Crashes are a DoS attack by definition; you brought the service or system down!
Hardware crashes are excluded
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 4/17
Research Questions
RQ1: How effectively can stack traces to be used to approximate the attack surface of a system?
RQ2: Can the performance of vulnerability prediction be improved by limiting the prediction space to the approximated attack surface?
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 5/17
OverviewCatalog all code that appears on stack traces
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 6/17
OverviewCatalog all code that appears on stack traces
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 6/17
OverviewCatalog all code that appears on stack traces
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 6/17
Data Sources
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
[4] "Description of the Dr. Watson for Windows," Microsoft Corporation, [Online]. Available: http://support.microsoft.com/kb/308538/en-us.
7/17
Attack Surface Construction (RQ1)
Data source, Crash ID, binary [4000+], filename [100,000+], function [10,000,000+]
Crashes Provide:Binary
Function
foo!foobarDeviceQueueRequest+0x68foo!fooDeviceSetup+0x72foo!fooAllDone+0xA8bar!barDeviceQueueRequest+0xB6bar!barDeviceSetup+0x08bar!barAllDone+0xFFcenter!processAction+0x1034center!dontDoAnything+0x1030
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 8/17
Results (RQ1)
Fuzzing User Induced
Crashes
%binaries 0.9% 48.4%
%vulnerabilities 14.9% 94.6%
Microsoft targets fuzzing towards high-risk modulesWe are covering the majority of vulnerabilities seen!
Targeting different crashes gets different results
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 9/17
Prediction Models (RQ2)
We believe that the key for [improving prediction] is by:
(1) developing new prediction techniques that deal with the“needle in the haystack” problem
(2) finding new metrics that deal with the unique characteristics of vulnerabilities and attacks.
Zimmermann et al. study [3]:
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
[3] T. Zimmermann, N. Nagappan and L. Williams, "Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista," in Software Testing, Verification and Validation (ICST), 2010 Third International Conference on, 2010
10/17
Prediction Models (RQ2)
We believe that the key for [improving prediction] is by:
(1) developing new prediction techniques that deal with the“needle in the haystack” problem
(2) finding new metrics that deal with the unique characteristics of vulnerabilities and attacks.
Zimmermann et al. study [3]:
Stack traces point to where flawed code lives!
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
[3] T. Zimmermann, N. Nagappan and L. Williams, "Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista," in Software Testing, Verification and Validation (ICST), 2010 Third International Conference on, 2010
10/17
Prediction Model Construction (RQ2)
Replicated the VPM from Windows Vista study
Run the VPM with all files considered as possibly vulnerable
Repeat, but remove code not found on stack traces
Vulnerability Prediction Model (VPM)
29 metrics in 6 categories:
ChurnDependencyLegacy
CODEMINE data [5]
SizeDefectsPre-release vulnerabilities
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
[5] J. Czerwonka, N. Nagappan, W. Schulte and B. Murphy, "CODEMINE: Building a Software Development Data Analytics Platform at Microsoft," Software, IEEE, vol. 30, no. 4, pp. 64--71, 2013.
11/17
Results (RQ2)
Comparing the VPM run on all files vs. just attack surface files…
Precision improved from 0.5 to 0.69
Recall improved from 0.02 to 0.05
Statistical improvement
Practical? No.
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 12/17
Problems with Precision [6]
No. Low precision is fine in several situations.
When the cost of missing the target is prohibitively expensive.When only a small fraction [of] the data is returned.When there is little or no cost in checking false alarms.
Are low precision predictors unsatisfactory?
…especially on highly imbalanced datasets.
Recall and precision like to compete
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 13/17
[6] Tim Menzies, Alex Dekhtyar, Justin Distefano, and Jeremy Greenwald. 2007. Problems with Precision: A Response to "Comments on 'Data Mining Static Code Attributes to Learn Defect Predictors'". IEEE Trans. Softw. Eng. 33, 9 (September 2007)
Problems with Precision [6]
No. Low precision is fine in several situations.
When the cost of missing the target is prohibitively expensive.When only a small fraction [of] the data is returned.When there is little or no cost in checking false alarms.This seems appropriate for security flaws!
Are low precision predictors unsatisfactory?
…especially on highly imbalanced datasets.
Recall and precision like to compete
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
[6] Tim Menzies, Alex Dekhtyar, Justin Distefano, and Jeremy Greenwald. 2007. Problems with Precision: A Response to "Comments on 'Data Mining Static Code Attributes to Learn Defect Predictors'". IEEE Trans. Softw. Eng. 33, 9 (September 2007)
13/17
Lessons Learned - Visualizations
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
Destination files
“Sourc
e”
file
s
14/17
Limitations
Stack traces are a good metric for Windows 8…
Different levels of granularity? (File/Function)Smaller projects? Open source?Not operating systems?
Results don’t necessarily generalize
Other learners?
Oversampling and Undersampling?
What else can we do with VPM’s?
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 15/17
Future Work
What else can we do with stack traces?
Frequency of appearanceDependencies, not the entities themselvesHow many stack traces are required?Sliding window; how does the approximation change over time?
Additional Metrics
Visualization Plugin for IDEs
…does it actually help?
Tool Development
Introduction | Methodology | Results and Discussion | Future Work | Conclusion 16/17
Introduction | Methodology | Results and Discussion | Future Work | Conclusion
foo!foobarDeviceQueueRequest+0x68foo!fooDeviceSetup+0x72foo!fooAllDone+0xA8bar!barDeviceQueueRequest+0xB6bar!barDeviceSetup+0x08bar!barAllDone+0xFFcenter!processAction+0x1034center!dontDoAnything+0x1030
Conclusion
17/17