evaluating static analysis tools dr. paul e. black [email protected]

Evaluating Static Analysis ToolsEvaluating Static Analysis Tools

Dr. Paul E. [email protected]

http://samate.nist.gov/

Static and Dynamic Analysis Static and Dynamic Analysis Complement Each OtherComplement Each OtherStatic Analysis Examine code

Handles unfinished code Can find backdoors, eg,

full access for user name “JoshuaCaleb ”

Potentially complete

Dynamic Analysis Run code

Code not needed, eg, embedded systems

Has few(er) assumptions Covers end-to-end or

system tests

Different Static Analyzers Are Different Static Analyzers Are Used For Different PurposesUsed For Different Purposes To check intellectual property violation By developers to decide if anything needs

to be fixed (and learn better practices) By auditors or reviewer to decide if it is

good enough for use

Dimensions of Static AnalysisDimensions of Static Analysis

Syntactic Heuristic Analytic Formal

General(implicit)

Application(explicit)

Source

Byte code

Binary

Level of Rigor

Pro

pert

ies

Code

Analysis can look for general or application-specific properties

Analysis can be on source code, byte code, or binary

The level of rigor can vary from syntactic to fully formal.

SATE 2008 OverviewSATE 2008 Overview

Static Analysis Tool Exposition (SATE) goals:– Enable empirical research based on large test sets

– Encourage improvement of tools

– Speed adoption of tools by objectively demonstrating their use on real software

NOT to choose the “best” tool Co-funded by NIST and DHS, Nat’l Cyber Security Division Participants:

• Aspect Security ASC HP DevInspect• Checkmarx CxSuite SofCheck Inspector for

Java• Flawfinder UMD FindBugs• Fortify SCA Veracode SecurityReview• Grammatech CodeSonar

Paul E. Black 6

SATE 2008 EventsSATE 2008 Events

Telecons, etc. to come up with procedures and goals We chose 6 C & Java programs with security implications

and gave them to tool makers (15 Feb) Tool makers ran tools and returned reports (29 Feb) We analyzed reports - (tried to) find “ground truth” (15 Apr)

• We expected a few thousand warnings - we got over 48,000.

Critique and update rounds with some tool makers (13 May) Everyone shared observations at a workshop (12 June) We released our final report and all data 30 June 2009

http://samate.nist.gov/index.php/SATE.html

http://samate.nist.gov/index.php/SATE.html

SATE 2008: There’s No Such SATE 2008: There’s No Such Thing as “One Weakness”Thing as “One Weakness” Only 1/8 to 1/3 of weaknesses are simple. The notion breaks down when

– weakness classes are related and– data or control flows are intermingled.

Even “location” is nebulous.

Hierarchy

Chains

lang = %2e./%2e./%2e/etc/passwd%00

Composites

from “Chains and Composites”,Steve Christey, MITRE http://cwe.mitre.org/data/reports/chains_and_composites.html

How Weakness Classes RelateHow Weakness Classes Relate

Cross-SiteScriptingCWE-79

CommandInjectionCWE-77

Improper InputValidation CWE-20

Validate-Before-Canonicalize

CWE-180

RelativePath Traversal

CWE-23

ContainerErrors

CWE-216Race

ConditionsCWE-362

PredictabilityCWE-340

PermissionsCWE-275

SymlinkFollowingCWE-61

use line 819 use line 808

Intermingled Flow:Intermingled Flow:2 sources, 2 sinks, 4 paths2 sources, 2 sinks, 4 pathsHow many weakness sites?How many weakness sites?

free line 1503 free line 2644

Other ObservationsOther Observations

Tools can’t catch everything: cleartext transmission, unimplemented features, improper access control, …

Tools catch real problems: XSS, buffer overflow, cross-site request forgery - 13 of SANS Top 25 (21 with related CWEs)

Tools reported some 200 different kinds of weaknesses– Buffer errors still very frequent in C

– Many XSS errors in Java “Raw” report rates vary by 3x depending on code Tools are even more helpful when “tuned”

Coding without security in mind leaves MANY weaknesses

Current Source Code Security Current Source Code Security Analyzers Have Little OverlapAnalyzers Have Little Overlap

2 tools3 tools 4 tools All 5

tools

Non-overlap: Hits reported by one tool and no others (84%)

Overlap: Hits reported by more than one tool (16%)

from MITRE

Precision & Recall ScoringPrecision & Recall Scoring

All True

Positives

No True

Positives

20

40

60

0 80

100

Reports Everythi

ng

Misses Everythi

ng

0

20

40

60

80

100

Fin

ds

mo

re f

law

s

Finds mostly flaws

“Better”

The Perfect ToolFinds all flaws and

finds only flaws

from DoD

Tool ATool A

All True

Positives

No True

Positives

Uninitialized variable useNull pointer dereference

Improper return value use

All flaw types

Use after free

TOCTOU

Memory leak

Buffer overflow

Tainted data/Unvalidated user input

20

40

60

0 80

100

Reports Everythi

ng

Misses Everythi

ng

0

20

40

60

80

100

from DoD

Tool BTool B

All True

Positives

No True

Positives

Uninitialized variable use

Null pointer dereference


All flaw types

Use after free

TOCTOU

Memory leak

Buffer overflow

Tainted data/Unvalidated user input

Command injection

Format string vulnerability

20

40

60

0 80

100

Reports Everythi

ng

Misses Everythi

ng

0

20

40

60

80

100

from DoD

Best ToolBest Tool

All True

Positives

No True

Positives

Uninitialized variable use


Use after free

TOCTOU

Memory leak

Buffer overflow

Tainted data/Unvalidated user inputCommand injection

Format string vulnerability

Null pointer dereference

20

40

60

0 80

100

Reports Everythi

ng

Misses Everythi

ng

0

20

40

60

80

100

from DoD

Tools Useful in Quality “Plains”Tools Useful in Quality “Plains”

Tools alone are not enough to achieve the highest “peaks” of quality.

In the “plains” of typical quality, tools can help.

If code is adrift in a “sea” of chaos, train developers.

Tararua mountains and the Horowhenua region, New ZealandSwazi Apparel Limited www.swazi.co.nz used with permission

http://www.swazi.co.nz/

Tips on Tool EvaluationTips on Tool Evaluation

Start with many examples covering code complexities and weaknesses

SAMATE Reference Dataset (SRD) http://samate.nist.gov/SRDMany cases from MIT: Lippmann, Zitser, Leek, Kratkiewicz

Add some of your typical code. Look for

– Weakness types (CWEs) reported– Code complexities handled– Traces, explanations, and other analyst support– Integration and machine-readable reports– Ability to write rules and ignore “known good” code

False alarm ratio (fp/tp) is a poor measure. Report density (r/kLoc) is probably better.

http://samate.nist.gov/SRD

evaluating static analysis tools dr. paul e. black [email protected]

Documents