where did this code come from?

Post on 24-Feb-2016

31 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Where Did This Code Come From?. Recovering the Provenance of Program Binaries. Nathan Rosenblum. GCC 4.2.x. 011101011010101010101110101001010101110001001001011010110011010101010101010010100101001001001101101101010110010010101101010010100101110101010101010101. unoptimized. C++. - PowerPoint PPT Presentation

TRANSCRIPT

Paradyn Project

Paradyn / Dyninst WeekMadison, Wisconsin

May 2-4, 2011

Where Did This Code Come From?

Nathan Rosenblum

Recovering the Provenance of Program Binaries

2Who Wrote This Code?

01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010

1010101

GCC 4.2.xunoptimizedC++

Toolchain Provenanc

e

3Who Wrote This Code?

01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010

1010101

ICC 10.xoptimizedC

Toolchain Provenanc

e

[mixed]

I C++

4

linux-vdso.so.1libpthread.so.0libasound.so.2libdl.so.2libstdc++.so.6libm.so.6libgcc_s.so.1libc.so.6 /lib64/ld-linux-x86-64.so.2librt.so.1

Debugging remote

deploymentscompiler bug?subtle incompatibility?

Forensicsreverse engineering

tools, obfuscations?decompiling

Why provenance?

Who Wrote This Code?

5Who Wrote This Code?

OutlinePROVENANCE STUDIESSYSTEM DESIGN

MODELING PROGRAM

PROVENANCE

BINARY CODE ABSTRACTIONS

System overview

6Who Wrote This Code?

01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010

1010101program

TARGET BINARY BINARY ANALYSIS

TOOL0111010110101010101011101010010110101110101101010101010111010100101101

0111010110101010101011101010010110101110101101010101010111010100101101

TRAINING DATA LEARNING FRAMEWORK

ICC

MSVS

representation +

feature selection

discovering evidence of provenance

7

Binary code model

program binary… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 … ⟨ mov [imm], rax ; sub [imm], rax ⟩⟨ push ebp ; * ; mov esp, ebp ⟩

Call Graphfprintf

External Libraries

code

Who Wrote This Code?

program

01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010

1010101

Control Flow Graphlayout, block

contents

8

Graphlets

code element nodes(e.g. basic

blocks)

typed edges(branch, call, etc.)

node colors

arithmeticprivileged instruction

Ex: instruction summary graphlets

Color bit field 214 possible colors

14 instruction categories

sparse in practice

Who Wrote This Code?

9

Modeling approach some amount of

code

feature vector

“decompiles to <push ebp,...”

contains 27 occurrences of

“”

Who Wrote This Code?

basic blockfunctionwhole program

Compiler toolchainC++C F77

optimized not optimized

Who Wrote This Code?

3.4 4.2 4.4 2003 2005 2008 10 11

11

Toolchain details [ISSTA 2011]

compiler familyGNU, Intel, Microsoft

source language version optimization levelC, C++, Fortran [several] low, high

functions

Who Wrote This Code?

language

family

optimization

version

Model as Conditional Random Field

Instruction sequence featuresSummary graphlet features

12

Evaluation

LanguageCompiler

OptimizationVersion

Functions Individually

(SVM) Linear CRF

.987

.971

.616

.910.998.993.910

.999

Who Wrote This Code?

same label likely

statistical dependenci

es

MSVC code generation changes little between

versions

13

Program authorshipfor(int i=0; i<sz;++i){// etc

std::vector<int>::iterator it = foo.begin();

while(it != foo.end()) {// etc

Who Wrote This Code?

I C++

14

Long-range control flow

Summary graphlets

basic blocks

supergraphlets

merged instruction summaries

Who Wrote This Code?

15

Interprocedural graphlets

FPRINTF

FOPEN

[local]

Who Wrote This Code?

Unique “color” for external functions

Anonymous internal functions

16

Program-author dataset

1. Author labels2. Parallel corpus 3. Linguistic homogeneity

(CS 537)several contest years

8-16 programs per contestant

C and C++ programs C programssome provided/template

code

Ideal:

Who Wrote This Code?

17

Author attribution391,056 N-grams54,705 idioms

37,358 graphlets117,997 supergraphlets8,062 call graphlets152 library calls

1,900 features

94.7% 93.7% 84.3%Top-5

CJ 2009 CJ 2010 CS 537

77.8% 76.8% 38.4%Exact

1. CS537 has much less data2. Template code + instructor

guidance confound results

Students have less distinctive styles?

Who Wrote This Code?

20 programmers

Summary

18Who Wrote This Code?

01110101101010101010111010100101010111000100100101101011001101010101010101001010010100100100110110110101011001001010110101001010010111010101010

1010101

questions

20

Backup slides follow

21

Program provenance

Systemglibc static codelibrary imports

Link & post-linkwhole-program optimization

rewriting toolsobfuscation tools

Compilerfamily

versionoptimization level

source language

Authorship

Who Wrote This Code?

22

Instruction-level features

‹mov [imm], rax ; sub [imm], rax›

‹push ebp ; * ; mov esp, ebp›

single-instruction wildcard

opcode class abstraction hidden immediates

IDIOMS

N-GRAMS <4889c2be> <8d45f8><018b45f8>

4-grams 3-grams

23

Digression: finding code

program binarycode… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 …

push ebp mov esp, ebp sub 0x2c, esp ...in 0x83,eax in [dx],al sub 0x57,al ...

[AAAI 2008]Model compiler-specific “function entry points”Compute max-likelihood labels F1 from .86 - .99 depending on compiler

Byte-sequence model [PASTE 2010]

program binary

GCC GCCICC ICC

Compiler labels modeled as CRF...

Who Wrote This Code?

yi yi-1 yj yj+1sequence labels ∈ {icc,gcc,...,data}…

… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 …

25

Digression: Conditional Random Fields

weights (learned)labels

evidence feature functions

Linear chain CRF

exact inference tractable

if xi decompiles to idiom u

otherwise=fu

Idiom feature function

0

1

Who Wrote This Code?

⎩⎨

Byte-sequence model [PASTE 2010]

program binary

Who Wrote This Code?

yi yi-1 yj yj+1sequence labels ∈ {icc,gcc,...,data}…

… 55 89 e5 83 ec 2c 57 56 53 8b 45 0c 8b 00 a3 90 a3 05 08 85 c0 74 2b 83 c4 …

94% accuracy labeling mixed-compiler sequences+18% accuracy increase in

function finding

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

27

Feature selection

(condor)

training programs

28

Distance metrics

distance metric

Mahalanobis distance

Equivalently:

How do we get A?

29

Style clustering

Programs, no training data

01110101101010101010111010100101010111000100100101101011001101010101010101001

011101011010101010101110101001010101110001001001011010110011010101010101010010111010110

1010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

Conclude: XWho Wrote This Code?

30

Transfer learning

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

Alice Bob ? ?

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

0111010110101010101011101010010101011100010010010110101100110101010101010100101110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

Large-margin Nearest Neighbors

(LMNN)Weinberger, Saul 2009

semi-definite program☹one-time cost☺

Who Wrote This Code?

31

Component modelssemi-open world provenance

component sharing (e.g. command and control)programmer movement between groups

mixture of styles

01110101101010101010111010100101010111000100100101101011001101010101010101001

01110101101010101010111010100101010111000100100101101011001101010101010101001

style vs. functionality?

infinite mixture models

interpreting style clusters

Who Wrote This Code?

32

Social code networks

program binaries

Who Wrote This Code?

top related