software bertillonage finding the provenance of software entities

38
Software Bertillonage Finding the Provenance of Software Entities Mike Godfrey Software Architecture Group (SWAG) University of Waterloo

Upload: zizi

Post on 15-Feb-2016

52 views

Category:

Documents


0 download

DESCRIPTION

Software Bertillonage Finding the Provenance of Software Entities. Mike Godfrey Software Architecture Group (SWAG) University of Waterloo. Work with …. Julius Davies ** Daniel German Abram Hindle. Wei Wang Ian Davis Cory Kapser * ** Lijie Zou Qiang Tu. ** Did most of the hard work - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Software Bertillonage Finding the Provenance of Software Entities

Software Bertillonage

Finding the Provenance of Software Entities

Mike GodfreySoftware Architecture Group (SWAG)

University of Waterloo

Page 2: Software Bertillonage Finding the Provenance of Software Entities

Work with …

• Julius Davies**• Daniel German• Abram Hindle

• Wei Wang• Ian Davis• Cory Kapser***• Lijie Zou• Qiang Tu

** Did most of the hard workthis time

2

*** Did most of the hard worklast time

Page 3: Software Bertillonage Finding the Provenance of Software Entities
Page 4: Software Bertillonage Finding the Provenance of Software Entities

4

“Provenance”

• Originally, used for art / antiques, but now used in science and IT:– Data provenance / audit trails– Component provenance for security– … but what about source code artifacts?

A set of documentary evidence pertaining to the origin, history, or ownership of an artifact.

[From “provenir”, French for “to come from”]

Page 5: Software Bertillonage Finding the Provenance of Software Entities

5

Consider this code … const char *err = ap_check_cmd_context(cmd, GLOBAL_ONLY); if (err != NULL) { return err; } ap_threads_per_child = atoi(arg); if (ap_threads_per_child > thread_limit) { ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: ThreadsPerChild of %d exceeds ThreadLimit " "value of %d", ap_threads_per_child, thread_limit); …. ap_threads_per_child = thread_limit; } else if (ap_threads_per_child < 1) { ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: Require ThreadsPerChild > 0, setting to 1"); ap_threads_per_child = 1; } return NULL;

Page 6: Software Bertillonage Finding the Provenance of Software Entities

6

const char *err = ap_check_cmd_context(cmd, GLOBAL_ONLY); if (err != NULL) { return err; } ap_threads_per_child = atoi(arg); if (ap_threads_per_child > thread_limit) { ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: ThreadsPerChild of %d exceeds ThreadLimit " "value of %d threads,", ap_threads_per_child, thread_limit); …. ap_threads_per_child = thread_limit; } else if (ap_threads_per_child < 1) {

ap_log_error(APLOG_MARK, APLOG_STARTUP, 0, NULL, "WARNING: Require ThreadsPerChild > 0, setting to 1");

ap_threads_per_child = 1; } return NULL;

… and this code

Page 7: Software Bertillonage Finding the Provenance of Software Entities

7

… or these two functionsgnumeric_oct2bin (FunctionEvalInfo *ei, GnmValue const * const *argv) {

return val_to_base (ei, argv[0], argv[1], 8, 2, 0, GNM_const(7777777777.0), V2B_STRINGS_MAXLEN | V2B_STRINGS_BLANK_ZERO);

}

gnumeric_hex2bin (FunctionEvalInfo *ei, GnmValue const * const *argv) {

return val_to_base (ei, argv[0], argv[1], 16, 2, 0, GNM_const(9999999999.0), V2B_STRINGS_MAXLEN | V2B_STRINGS_BLANK_ZERO);

}

Page 8: Software Bertillonage Finding the Provenance of Software Entities

8

Or this …static PyObject * py_new_RangeRef_object (const GnmRangeRef *range_ref){

py_RangeRef_object *self;self = PyObject_NEW py_RangeRef_object,

&py_RangeRef_object_type); if (self == NULL) {

return NULL; } self->range_ref = *range_ref; return (PyObject *) self;

}

Page 9: Software Bertillonage Finding the Provenance of Software Entities

9

… and thisstatic PyObject * py_new_Range_object (GnmRange const *range) {

py_Range_object *self; self = PyObject_NEW (py_Range_object,

&py_Range_object_type); if (self == NULL) {

return NULL; } self->range = *range; return (PyObject *) self;

}

Page 10: Software Bertillonage Finding the Provenance of Software Entities

“Code clone” detection methods

• Strings• Tokens• ASTs • PDGs

10

Time and complexity/ prog lang dependence

Page 11: Software Bertillonage Finding the Provenance of Software Entities

Provenance: Related ideas

• Software clone detection– Why?

• Just “understand” where/why duplication has occurred• Possible refactoring to reduce inconsistent maintenance, binary

footprint size, to improve design, …• Tracking software licensing compatibilities, esp. included libraries

and cross-product entity “adoption”

– Many techniques for this

11

Page 12: Software Bertillonage Finding the Provenance of Software Entities

Provenance: Related ideas

• “Origin analysis” + software genealogy– Why?

• Program comprehension• Name / location change of sw

entity within a system can break longitudinal studies

– Use entity and relationship analysis to look for likely suspects [TSE-05]

g

y x

zVold

f

y x

zVnew

???

12

Page 13: Software Bertillonage Finding the Provenance of Software Entities

Provenance: Related ideas

• Software dev. “recommender systems”, mining software repositories– Why?

• Given info about similar situations, what might be helpful / informative in this situation?

– Many techniques (AI, LSI, LDA, data mining, plus ad hoc specializations + combinations)

• … and so on …13

Page 14: Software Bertillonage Finding the Provenance of Software Entities

Who are you?

Alphonse Bertillon(1853-1914)

14

Page 15: Software Bertillonage Finding the Provenance of Software Entities

15

Page 16: Software Bertillonage Finding the Provenance of Software Entities

Bertillonage metrics

1. Height2. Stretch: Length of body from left

shoulder to right middle finger when arm is raised

3. Bust: Length of torso from head to seat, taken when seated

4. Length of head: Crown to forehead5. Width of head: Temple to temple6. Length of right ear7. Length of left foot8. Length of left middle finger9. Length of left cubit: Elbow to tip of

middle finger10. Width of cheeks

16

Page 17: Software Bertillonage Finding the Provenance of Software Entities

Forensic Bertillonage• Some problems …

– Equipment was cumbersome, expensive, required training– Measurement error, consistency– The metrics were not independent!– Adoption (and later abandonment)

• … but overall it was a big success! – Quick and dirty, and a huge leap forward– Some training and tools required but could be performed with

technology of late 1800s– If done accurately, could quickly narrow down a very large pool of

mugshots to only a handful

17

Page 18: Software Bertillonage Finding the Provenance of Software Entities

Software Bertillonage

• We want quick & dirty ways investigating the provenance of a function (file, library, binary, etc.)

– Who are you, really? • Entity and relationship analysis

– Where did you come from?• Evolutionary history

– Does your mother know you’re here?• Licensing

18

Page 19: Software Bertillonage Finding the Provenance of Software Entities

Bertillonage desiderata• A good Bertillonage metric should:

– be computationally inexpensive– be applicable to the desired level of granularity and programming

language– catch most of the bad guys (recall)– significantly reduce the search space (precision)

• Why not “fingerprinting” or “DNA analysis”?– Often there just is not enough info (or too much noise) to make

conclusive identification– So we hope to reduce the candidate set so that manual examination is

feasible

19

Page 20: Software Bertillonage Finding the Provenance of Software Entities

Bertillonage meta-techniques

20

1. Count based size, LOC, fan-in, McCabe

2. Set based contained string literals, method names

3. Relationship based call sets, throws sets, libraries included / used

4. Sequence based methods in order, token-based clone detection

5. Graph based AST and PDG clone detection

Page 21: Software Bertillonage Finding the Provenance of Software Entities

21

Software Bertillonage:

A motivating example

Page 22: Software Bertillonage Finding the Provenance of Software Entities

A real-world problem• Software packages often bundle in third-party libraries to

avoid “DLL-hell” [Di Penta-10]– In Java world, jars may include library source code or just byte code– Included libs may include other libs too!

• Payment Card Industry Data Security Std (PCI-DSS), Req #6:– “All critical systems must have the most recently released, appropriate

software patches to protect against exploitation and compromise of cardholder data.”

What if a financial software package doesn’t explicitly list the version IDs of its included libraries?

22

Page 23: Software Bertillonage Finding the Provenance of Software Entities

Identifying included libraries• The version ID may be embedded in the name of the

component!e.g., commons-codec-1.1.jar– … but often the version info is simply not there!

• Use fully qualified name (FQN) of each class plus a code search engine [Di Penta 10] – Won’t work if we don’t have library source code

• Compare against all known compiled binaries– But compilers, build-time compilation options may differ

23

Page 24: Software Bertillonage Finding the Provenance of Software Entities

Anchored class signatures• Idea: Compile / acquire all known lib versions but extract

only the signatures, then compare against target binary– Shouldn’t vary by compiler/build settings

• For a class C with methods M1, … , Mn, we define its anchored class signature as:

θ(C) = σ(C), σ(M⟨ ⟨ 1), ..., σ(Mn)⟩⟩

• For an archive A composed of classes C1,…,Ck, we define its anchored class signature as

θ(A) = {θ(C1 ), ..., θ(Ck )}

θ(C) = σ(C), σ(M⟨ ⟨ 1), ..., σ(Mn)⟩⟩

θ(A) = {θ(C1 ), ..., θ(Ck )}24

Page 25: Software Bertillonage Finding the Provenance of Software Entities

// This is **decompiled** source!!package a.b;

public class C extends java.lang.Object implements g.h.I { public C() { // default constructor is inserted by javac }

synchronized static int a (java.lang.String s) throws a.b.E {

// decompiled byte code omitted }}

σ(C) = public class a.b.C extends Object implements I σ(M1 ) = public C()σ(M2 ) = default synchronized static int a(String) throws E

θ(C) = σ(C), σ(M⟨ ⟨ 1 ), σ(M2 )⟩⟩25

Page 26: Software Bertillonage Finding the Provenance of Software Entities

Archive similarity• We define the similarity index of two archives as their Jaccard

coefficient:

• We define the inclusion index of two archives as:

26

Page 27: Software Bertillonage Finding the Provenance of Software Entities

Implementation• Created byte code (bcel5) and source code signature

extractors

• Used SHA1 hash for class signatures to improve performance– We don’t care about near misses at the method or class level!

• Built corpus from Maven2 jar repository– Maven is unversioned + volatile!– 150 GB of jars, zips, tarballs, etc., – 130,000 binary jars (75,000 unique)– 26M .class files, 4M .java source files (incl. duplicates)– Archives contain archives: 75,000 classes are nested 4 levels deep!

27

Page 28: Software Bertillonage Finding the Provenance of Software Entities

Investigation

Target system: An industrial e-commerce app containing 84 jars

RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive?

RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive?

28

Page 29: Software Bertillonage Finding the Provenance of Software Entities

InvestigationRQ1: How useful is the signature similarity index at finding the

original binary archive for a given binary archive?

• 51 / 84 binary jars (60.7%), we found a single candidate from the corpus with similarity index of 1.0.– 48 exact, 3 correct product (target version not in Maven)

• 20 / 84 we found multiple matches with simIndex = 1.0– 19 exact, 1 correct product

• 12 / 84 we found matches with 0 < simIndex < 1.0– 1 exact, 9 correct product, 2 incorrect product

• 1 / 84 we found no match (product not in Maven as a binary)

29More data here: http://juliusdavies.ca/uvic/jarchive/

Page 30: Software Bertillonage Finding the Provenance of Software Entities

InvestigationRQ2: How useful is the signature similarity index at finding the

original sources for a given binary archive?

• 22 / 84 binary jars (26.2%), we found a single candidate from the corpus with similarity index of 1.0.– 13 exact, 2 correct product

• 7 / 84 we found multiple matches with simIndex = 1.0– 6 exact, 1 correct product

• 46 / 84 we found matches with 0 < simIndex < 1.0– 25 exact, 20 correct product, 1 incorrect product

• 16 / 84 we found no match (product not in Maven as source)

30More data here: http://juliusdavies.ca/uvic/jarchive/

Page 31: Software Bertillonage Finding the Provenance of Software Entities

Investigation

RQ1: How useful is the signature similarity index in finding the original binary archive for a given binary archive?

• Found exact match or correct product 81 / 84 times (96.4%)

RQ2: How useful is the signature similarity index at finding the original sources for a given binary archive?

• Found exact match or correct product 57 / 84 times (67.9%)

31

Page 32: Software Bertillonage Finding the Provenance of Software Entities

32

Further uses1. Used the version info from extracted from e-commerce app

to perform audits for licensing and security– One jar changed open source licenses (GNU Affero, LGPL) – One jar version was found to have known security bugs

2. When did Google Android developers copy-paste httpclient.jar classes into android.jar?– And how much work would it be to include a newer version?– We narrowed it down to two likely candidates, one of which turned

out to be correct.

Page 33: Software Bertillonage Finding the Provenance of Software Entities

33

Summary:Anchored signature matching

• If the master repository is rich enough, anchored signature matching can be highly accurate for both source and binary– Tho might have to examine multiple candidates with perfect scores

(median: 4, max: 30)– Also works well if product present, but target version missing

• The approach is fast and simple, once the repository has been built

Page 34: Software Bertillonage Finding the Provenance of Software Entities

Summary:Software Provenance / Bertillonage

• Who are you?– Determining the provenance of software entities is a growing and

important problem

• Software Bertillonage:– Quick and dirty techniques applied widely, then expensive techniques

applied narrowly

• Identifying version IDs of included Java libraries is an example of the software provenance problem– And our solution of anchored signature matching is an example of

software Bertillonage 34

Page 35: Software Bertillonage Finding the Provenance of Software Entities

35

References• “Software Bertillonage: Finding the provenance of an entity”, by Julius

Davies, Daniel M. German, Michael W. Godfrey, and Abram J. Hindle. Under review.

• “Copy-Paste as a Principled Engineering Tool”, Michael W. Godfrey and Cory J. Kapser. Chapter 28 in the book Making Software: What Really Works and Why We Believe It, by Greg Wilson and Andy Oram (eds.), O'Reilly and Associates, October 2010.

• “Using Origin Analysis to Detect Merging and Splitting of Source Code Entities”, by Michael W. Godfrey and Lijie Zou, IEEE Transaction on Software Engineering, 31(2), Feb 2005.

Page 36: Software Bertillonage Finding the Provenance of Software Entities

36

Chapter 28 is

awesome!!

All author proceeds to Amnesty

International!!

Page 37: Software Bertillonage Finding the Provenance of Software Entities

Non-CS References• Fingerprints: The Origins of Crime Detection and the Murder

Case that Launched Forensic Science, Colin Beavan, Hyperion Publishing, 2001.

• http://en.wikipedia.org/wiki/Alphonse_Bertillon

37

Page 38: Software Bertillonage Finding the Provenance of Software Entities

Software Bertillonage

Finding the Provenance of Software Entities

Mike GodfreySoftware Architecture Group (SWAG)

University of Waterloo