© 2006 nathan rosenblummarch 2006unconventional code constructs the new dyninst code parser: binary...

34
Unconventional Code Constructs © 2006 Nathan Rosenblum March 2006 The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum University of Wisconsin [email protected]

Upload: javon-brien

Post on 14-Dec-2015

221 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

Unconventional Code Constructs

© 2006 Nathan Rosenblum

March 2006

The New Dyninst Code Parser: Binary Code Isn't as

Simple as it Used to Be

Nathan RosenblumUniversity of Wisconsin

[email protected]

Page 2: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 2 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Binary Analysis Processing of the binary code to extract

syntactic and symbolic information from many sources:•Symbol tables (if present)•Decode (disassemble) instructions•Control-flow information: basic blocks, loops, functions

•Data-flow information: from basic register information to highly sophisticated (and expensive) analyses.

Page 3: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 3 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Products of Binary Analysis High-level organization and characteristics

•Function entry/exit points•Intra-procedural call graph•Inter-procedural control-flow graph•Exception handlers•Jump tables•Virtual function tables

Abstract assembly representation Data-flow characteristics

•Register liveness (for instrumentation, modification)

Page 4: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 4 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Uses of Binary Analysis Debugging Testing Performance profiling Performance modeling

Behavior Modeling Dynamic Modification Binary Rewriting Reverse engineering

Page 5: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 5 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Binary Analysis Tool GoalsSafe Eliminate false positives to make

instrumentation safe

Accurate Minimize false negatives for complete view of the binary

Opportunistic Use all available information and techniques to maximum effect

Resilient Tools are robust to unexpected and unusual applications

Automated Analysis does not depend on human interaction

Complementary

Produce products compatible with source-level analysis tools.

Nate Rosenblum
I changed all of the left column words to adjectives, to fit with the phrase "Binary analysis tools should be _______"
Page 6: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 6 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Why is Binary Analysis Hard?

Func foo()

{

switch(a) {

}

}

push %ebp

mov %esp, %ebp

mov [0x1d], %eax

jmp *%eax

The Compiler

Source Code Binary

Page 7: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 7 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Current Approaches Linear disassembly of binaries is insufficient

•Symbol tables often lie, or are absent•Functions are not address ranges, may be

non-contiguous Parsing based on program control flow

•Commonly used approach:

UQBT LEEL

RAD IDA-Pro

Dyninst•Must contend with gaps in known code regions

after parsing

Page 8: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 8 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Dyninst Control Flow Parsing Opportunistic parsing:

•Utilizes symbol table and other information when available (and sensible)

Provides more accurate view of the binary than linear disassembly

Addresses problem of gaps in the binary through speculative parsing•Heuristics to identify function preambles

Page 9: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 9 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Control Flow Traversal Illustrated

<func foo>:

00: mov [a8], r1

04: mov [ac], r2

08: add r1, r2, r3

0c: cmp r3, 0

10: bne 24

14: call <bar>

18: add r3, 8, r3

1c: call <baz>

20: jmp 28

24: mul r2, 2, r3

28: sub r1, r3, r1

. . .

00

1424

28

•Parsing follows control flow•Control transfers are edges in the CFG•Target blocks can parsed in any order

Page 10: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 10 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Control Flow Traversal Illustrated

<func foo>:

00: mov [a8], r1

04: mov [ac], r2

08: add r1, r2, r3

0c: cmp r3, 0

10: bne 24

14: call <bar>

18: add r3, 8, r3

1c: call <baz>

20: jmp 28

24: mul r2, 2, r3

28: sub r1, r3, r1

. . .

•Call sites determine location of functions•Targets of calls are added to the function parsing work list

Known Functionsfooquuxquuuxbarbaz

Page 11: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 11 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Binary Parsing Challenges Pointer-based control transfer Non-returning calls Non-contiguous code sections Tail calls Gaps in the binary Exception handlers Shared code and multiple entry

representation

Page 12: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 12 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Non-returning Call Sites Some functions will not return

•Examples: abort, exit Code following call site may not be

valid Even if names are available, calls may

be hard to detect:dfaerror fatal exit

Page 13: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 13 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Detecting Non-Returning Functions

Goal: detect non-returning functions from first principles

Identify distinguishing features of non-returning functions•Wide variety of

behavior in non-returning functions makes this difficult

Example: operations in abort

abort() ->

sigaction()

IO_flush_all()

raise(SIGABRT) ->

kill(getpid(),sig)

hlt [privileged instruction]

Page 14: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 14 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Non-returning Call Sites

000214d0 <__assert_fail>:

. . .

2160f: e8 cc db 0a 00 call cf1e0 <__libc_write>

21614: e8 07 7f 00 00 call 29520 <abort>

21619: 90 nop

2161a: 90 nop

2161b: 90 nop

2161c: 90 nop

2161d: 90 nop

2161e: 90 nop

2161f: 90 nop

00021620 <__assert_perror_fail>:

21620: 55 push %ebp

21621: 89 e5 mov %esp,%ebp

. . .

Example: GNU libc library routines

•Call to abort does not return

•Parser will naively follow control into the following region

•Bytes following call site may not be code (e.g., jump tables, other functions, string data)

Page 15: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 15 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Non-contiguous Code

Func Foo •Functions are not address ranges•Symbol table representation fails•Many sources of non-contiguous layout:

•Jump tables•Data (strings, etc)•Unparsed code•Exception handlers•Padding or junk bytes

Page 16: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 16 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Non-contiguous Code

. . .

77e7b1cb: 83 41 04 04 addl $0x4,0x4(%ecx)

77e7b1cf: 5d pop %ebp

77e7b1d0: c2 0c 00 ret $0xc

77e7b1d3: 68 f5 06 00 00 push $0x6f5

77e7b1d8: eb 05 jmp 0x77e7b1df

77e7b1da: 68 e6 06 00 00 push $0x6e6

77e7b1df: e8 bb 86 02 00 call 0x77ea389f

77e7b1e4: 4c ba e7 77

77e7b1e8: 34 b2 e7 77

77e7b1ec: b5 b1 e7 77

77e7b1f0: 0c 9f e8 77

77e7b1f4: 96 37 e8 77

77e7b1f8: cf b1 e7 77

77e7b1fc: 00 00 00 00 01 01 01 02 02 02 03 03 04 02 05

77e7b20c: 3c 10 cmp $0x10,%al

77e7b20e: 0f 85 a6 3b 02 00 jne 0x77e9edba

. . .

Example: Microsoft Word

•Jump table separates valid instruction sequences

•Control following call site is invalid

Page 17: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 17 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Named Non-contiguous Sections

00021060 <__duplocale>:

....

210f0:  lock cmpxchg %ecx,0x2968(%ebx)

210f8:  jne    2118e

210fe:  xor    %esi,%esi

21100:  cmp    $0x6,%esi

...

0002118e <_L_mutex_lock_78>:

2118e: lea    0x2968(%ebx),%ecx

21194: call   ea0f0

21199: jmp    210fe

Example: GNU libc library routines

•Looks like shared code

•Fragment is not a real function

Page 18: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 18 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Named Non-contiguous Sections Recognizing function fragments

•Have a symbol table entry•Reached by branches from one function

•Branch back to one function Use combination of CFG and symbol

table clues

Page 19: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 19 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Tail Calls

Func Bar

. . .

jmp <quux>

Func Quux

•Compiler has joined two functions into one

•Looks like non-contiguous shared code

. . .

ret

Func Foo. . .

call <bar>

Page 20: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 20 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Gap Parsing

Func Foo

Func Bar

Unidentified section of code

•Gaps between known code regions may contain undiscovered functions

•Targets of indirect calls

Speculative parsing: pattern-based heuristics to recognize function prologues in gaps

Page 21: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 21 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Exceptions Exception

handling code is normally unreachable

Use information in the binary where available•Example: Linux ELF exception tables

C++ style exception

catch block

push %ebp

mov %esp,%ebp

push %ebx

sub $0x24,%esp

movl $0x6,0xfffffff8(%ebp)

mov 0x8(%ebp),%eax

mov %eax,(%esp)

call 804aafa

jmp 804abe9

mov %eax,0xfffffff4(%ebp)

cmp $0x2,%edx

je 804ab58

. . .

mov 0xfffffff4(%ebp),%eax

mov %eax,(%esp)

call 804a388

add $0x24,%esp

pop %ebx

pop %ebp

ret

Page 22: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 22 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Shared Code Models

Shared Code

Func A Func B Code may be shared between functions•Multiple entry

functions•Compiler

optimizations Analysis tools must

be able to recognize and handle overlapping control flow

Page 23: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 23 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Summary of Binary Analysis Techniques

Control flow traversal is a powerful tool for addressing the challenges of modern binaries•Lying/missing symbol tables•Data/code disambiguation•Jump tables

Speculative parsing techniques can be useful for expanding code coverage•Gaps in code•Indirect calls and branches

Page 24: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 24 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Incidence of Shared Code in Binaries

0

100

200

300

400

500

600

Number of Binaries

0 4 16 64 256 1024

Functions containing shared code

Parsed 828 Linux/x86 binaries•238 contained

shared code Most binaries

contain only a few code-sharing functions

Some code sharing may be due to non-returning call sites

Page 25: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 25 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Where Do We Go From Here? Are there good solutions from first

principles?•Almost certainly.•We are just starting to explore the limits of such techniques.

Are special case solutions necessary?•Again, almost certainly.•We will try to use these as sparingly as possible.

Page 26: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 26 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Future Directions in Binary Analysis

Problem: code exists but is unreachable through standard control-flow traversal parsing•Heuristics are a moving target

Existing opportunistic parsing techniques can help, but only to an extent•Exception handlers, virtual function tables

may be recoverable from the binary Given the information we can recover from

traditional techniques, can we synthesize additional information that will increase coverage of the binary?

Page 27: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 27 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Statistical Binary Parsing Can we utilize known code to find

unknown code?•We have a partial parse of the binary•Code unknown regions of the binary will likely share characteristics with previously identified code

Identify code in unknown regions:•Create a probabilistic model of valid code

•Identify sections of unknown regions in the binary that are similar to valid code

Page 28: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 28 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Binary Modeling Techniques Code idioms are one possibility for

validating potential code•Function preambles, jump table bounds

tests, system call stubs, case statements Idioms can be identified manually Model can be trained to identify new idioms

with machine learning techniques•n-gram models, long-distance interaction

Unparsed code can be scored to indicate its statistical similarity to known code

Page 29: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 29 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Open Questions in Binary Analysis

What learning techniques will yield the best results?

How can we overcome the relative dearth of information in binaries with very little code reachable through control flow analysis?•Incorporate information from analysis of other binaries

What techniques will allow us to accurately identify the range of recognizable code?

Page 30: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 30 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Questions?

Page 31: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 31 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Backup Slides

Page 32: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 32 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Shared Code Models

Shared Code Multiple Entry

Func A Func B Entry A Entry B

What is the difference from the perspective of the parser?

Page 33: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 33 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

A Choice of Abstraction Shared code and multiple entry

models are similar•Represent independent flows of control merging together

Shared model is a better fit for Dyninst•Preserves semantic guarantees of function independence

Page 34: © 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum

– 34 – Unconventional Code Constructs

© 2006 Nathan Rosenblum

Shared Code

000a94c0 <__waitpid>:

a94c0: cmpl $0x0,%gs:0xc

a94c8: jne a94e7

000a94ca <__waitpid_nocancel>:

a94ca: push %ebx

a94cb: mov 0x10(%esp,1),%edx

a94cf: mov 0xc(%esp,1),%ecx

a94d3: mov 0x8(%esp,1),%ebx

a94d7: mov $0x7,%eax

a94dc: int $0x80

a94de: pop %ebx

a94df: cmp $0xfffff001,%eax

a94e4: jae a9513

. . .

Code common to the two functions is marked as shared.

Example: GNU libc library routines