evaluating static analysis tools dr. paul e. black [email protected]
TRANSCRIPT
Evaluating Static Analysis ToolsEvaluating Static Analysis Tools
Dr. Paul E. [email protected]
http://samate.nist.gov/
Static and Dynamic Analysis Static and Dynamic Analysis Complement Each OtherComplement Each OtherStatic Analysis Examine code
Handles unfinished code Can find backdoors, eg,
full access for user name “JoshuaCaleb ”
Potentially complete
Dynamic Analysis Run code
Code not needed, eg, embedded systems
Has few(er) assumptions Covers end-to-end or
system tests
Different Static Analyzers Are Different Static Analyzers Are Used For Different PurposesUsed For Different Purposes To check intellectual property violation By developers to decide if anything needs
to be fixed (and learn better practices) By auditors or reviewer to decide if it is
good enough for use
Dimensions of Static AnalysisDimensions of Static Analysis
Syntactic Heuristic Analytic Formal
General(implicit)
Application(explicit)
Source
Byte code
Binary
Level of Rigor
Pro
pert
ies
Code
Analysis can look for general or application-specific properties
Analysis can be on source code, byte code, or binary
The level of rigor can vary from syntactic to fully formal.
SATE 2008 OverviewSATE 2008 Overview
Static Analysis Tool Exposition (SATE) goals:– Enable empirical research based on large test sets
– Encourage improvement of tools
– Speed adoption of tools by objectively demonstrating their use on real software
NOT to choose the “best” tool Co-funded by NIST and DHS, Nat’l Cyber Security Division Participants:
• Aspect Security ASC HP DevInspect• Checkmarx CxSuite SofCheck Inspector for
Java• Flawfinder UMD FindBugs• Fortify SCA Veracode SecurityReview• Grammatech CodeSonar
Paul E. Black 6
SATE 2008 EventsSATE 2008 Events
Telecons, etc. to come up with procedures and goals We chose 6 C & Java programs with security implications
and gave them to tool makers (15 Feb) Tool makers ran tools and returned reports (29 Feb) We analyzed reports - (tried to) find “ground truth” (15 Apr)
• We expected a few thousand warnings - we got over 48,000.
Critique and update rounds with some tool makers (13 May) Everyone shared observations at a workshop (12 June) We released our final report and all data 30 June 2009
http://samate.nist.gov/index.php/SATE.html
SATE 2008: There’s No Such SATE 2008: There’s No Such Thing as “One Weakness”Thing as “One Weakness” Only 1/8 to 1/3 of weaknesses are simple. The notion breaks down when
– weakness classes are related and– data or control flows are intermingled.
Even “location” is nebulous.
Hierarchy
Chains
lang = %2e./%2e./%2e/etc/passwd%00
Composites
from “Chains and Composites”,Steve Christey, MITRE http://cwe.mitre.org/data/reports/chains_and_composites.html
How Weakness Classes RelateHow Weakness Classes Relate
Cross-SiteScriptingCWE-79
CommandInjectionCWE-77
Improper InputValidation CWE-20
Validate-Before-Canonicalize
CWE-180
RelativePath Traversal
CWE-23
ContainerErrors
CWE-216Race
ConditionsCWE-362
PredictabilityCWE-340
PermissionsCWE-275
SymlinkFollowingCWE-61
use line 819 use line 808
Intermingled Flow:Intermingled Flow:2 sources, 2 sinks, 4 paths2 sources, 2 sinks, 4 pathsHow many weakness sites?How many weakness sites?
free line 1503 free line 2644
Other ObservationsOther Observations
Tools can’t catch everything: cleartext transmission, unimplemented features, improper access control, …
Tools catch real problems: XSS, buffer overflow, cross-site request forgery - 13 of SANS Top 25 (21 with related CWEs)
Tools reported some 200 different kinds of weaknesses– Buffer errors still very frequent in C
– Many XSS errors in Java “Raw” report rates vary by 3x depending on code Tools are even more helpful when “tuned”
Coding without security in mind leaves MANY weaknesses
Current Source Code Security Current Source Code Security Analyzers Have Little OverlapAnalyzers Have Little Overlap
2 tools3 tools 4 tools All 5
tools
Non-overlap: Hits reported by one tool and no others (84%)
Overlap: Hits reported by more than one tool (16%)
from MITRE
Precision & Recall ScoringPrecision & Recall Scoring
All True
Positives
No True
Positives
20
40
60
0 80
100
Reports Everythi
ng
Misses Everythi
ng
0
20
40
60
80
100
Fin
ds
mo
re f
law
s
Finds mostly flaws
“Better”
The Perfect ToolFinds all flaws and
finds only flaws
from DoD
Tool ATool A
All True
Positives
No True
Positives
Uninitialized variable useNull pointer dereference
Improper return value use
All flaw types
Use after free
TOCTOU
Memory leak
Buffer overflow
Tainted data/Unvalidated user input
20
40
60
0 80
100
Reports Everythi
ng
Misses Everythi
ng
0
20
40
60
80
100
from DoD
Tool BTool B
All True
Positives
No True
Positives
Uninitialized variable use
Null pointer dereference
Improper return value use
All flaw types
Use after free
TOCTOU
Memory leak
Buffer overflow
Tainted data/Unvalidated user input
Command injection
Format string vulnerability
20
40
60
0 80
100
Reports Everythi
ng
Misses Everythi
ng
0
20
40
60
80
100
from DoD
Best ToolBest Tool
All True
Positives
No True
Positives
Uninitialized variable use
Improper return value use
Use after free
TOCTOU
Memory leak
Buffer overflow
Tainted data/Unvalidated user inputCommand injection
Format string vulnerability
Null pointer dereference
20
40
60
0 80
100
Reports Everythi
ng
Misses Everythi
ng
0
20
40
60
80
100
from DoD
Tools Useful in Quality “Plains”Tools Useful in Quality “Plains”
Tools alone are not enough to achieve the highest “peaks” of quality.
In the “plains” of typical quality, tools can help.
If code is adrift in a “sea” of chaos, train developers.
Tararua mountains and the Horowhenua region, New ZealandSwazi Apparel Limited www.swazi.co.nz used with permission
Tips on Tool EvaluationTips on Tool Evaluation
Start with many examples covering code complexities and weaknesses
SAMATE Reference Dataset (SRD) http://samate.nist.gov/SRDMany cases from MIT: Lippmann, Zitser, Leek, Kratkiewicz
Add some of your typical code. Look for
– Weakness types (CWEs) reported– Code complexities handled– Traces, explanations, and other analyst support– Integration and machine-readable reports– Ability to write rules and ignore “known good” code
False alarm ratio (fp/tp) is a poor measure. Report density (r/kLoc) is probably better.