faster, more effective flowgraph -based malware classification

23
Faster, More Effective Flowgraph-based Malware Classification Silvio Cesare [email protected] http://www.foocodechu.com Ph.D. Candidate, Deakin University

Upload: margie

Post on 09-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Faster, More Effective Flowgraph -based Malware Classification. Silvio Cesare [email protected] http://www.foocodechu.com Ph.D. Candidate, Deakin University. Who am I and where did this talk come from?. Ph.D. Candidate at Deakin University. Research Malware detection. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Faster, More Effective  Flowgraph -based Malware Classification

Faster, More Effective Flowgraph-based Malware

ClassificationSilvio Cesare [email protected]

http://www.foocodechu.comPh.D. Candidate, Deakin University

Page 2: Faster, More Effective  Flowgraph -based Malware Classification

Ph.D. Candidate at Deakin University. Research

◦ Malware detection.◦ Automated vulnerability discovery (check out my

other talk in the main conference). Did a Masters by research in malware

◦ “Fast automated unpacking and classification of malware”.

◦ Presented last year at Ruxcon 2010. This current work extends last year’s work.

Who am I and where did this talk come from?

Page 3: Faster, More Effective  Flowgraph -based Malware Classification

Traditional AV works well on known samples.

Doesn’t detect unknown samples.

Doesn’t detect “suspiciously similar” samples.

Uses strings as a signature or “birthmark”.

Compares birthmarks by equality.

Motivation

Page 4: Faster, More Effective  Flowgraph -based Malware Classification

Birthmarks can be program structure.

More static among malware variants.

Birthmarks can be compared using “approximate similarity”.

Able to detect unknown samples that are suspiciously similar to known malware.

Vastly reduce number of required signatures.

What can be done?

Page 5: Faster, More Effective  Flowgraph -based Malware Classification

The Software Similarity Problem

Program p

Program q

Birthmark

Birthmark

Similar?

MATCH!

Different

Page 6: Faster, More Effective  Flowgraph -based Malware Classification

Control flow is more invariant among polymorphic and metamorphic malware.

A directed graph representing control flow.

A control flow graph for every procedure.

One call graph per program.

The Control Flow Birthmark

Page 7: Faster, More Effective  Flowgraph -based Malware Classification

Graphs

movl $0x4020a0,(%esp)call 4011b8 <_puts>addl $0x1,-0x8(%ebp)

lea 0x4(%esp),%ecxand $0xfffffff0,%esppushl -0x4(%ecx)push %ebpmov %esp,%ebppush %ecxsub $0x24,%espcall 4011b0 <___main>movl $0x0,-0x8(%ebp)jmp 40115f <_main+0x2f>

add $0x24,%esppop %ecxpop %ebplea -0x4(%ecx),%espret

cmpl $0x9,-0x8(%ebp)jle 40114f <_main+0x1f>

Proc_0

Proc_2

Proc_1

Proc_4

Proc_3

Page 8: Faster, More Effective  Flowgraph -based Malware Classification

Known as the “Graph Isomorphism” problem.

Identifies equivalent “structure”.

Not proven to be in NP, but no polynomial time algorithm known.

Graph Equality

Page 9: Faster, More Effective  Flowgraph -based Malware Classification

The number of basic operations applied to a graph to transform it to another graph.

If you know the distance between two objects, you know the similarity.

Complexity in NP and infeasible.

Graph Edit Distance

Page 10: Faster, More Effective  Flowgraph -based Malware Classification

Decompilation

L_0

L_3

L_6

L_7L_1

L_2 L_4

L_5

true

true

true

true

true

W|IEH}Rproc(){L_0: while (v1 || v2) {L_1: if (v3) {L_2: } else {L_4: }L_5: }L_7: return;}

Page 11: Faster, More Effective  Flowgraph -based Malware Classification

Input is a string.

Extract all substrings of fixed size Q.

Substrings are known as q-grams.

Let’s take q-grams of all decompiled graphs.

Q-Grams

W|IEH}R

W|IE|IEHIEH}EH}R

Page 12: Faster, More Effective  Flowgraph -based Malware Classification

An array <E1,...,En>

A feature vector describes the number of occurrences of each feature.

En is the number of times feature En occurs.

Let’s make the 500 most common q-grams as features.

We use feature vectors as birthmarks.

Feature Vectors

Page 13: Faster, More Effective  Flowgraph -based Malware Classification

A vector is an n-dimensional point. E.g. 2d vector is <x,y> Fast.

Vector Distance

Page 14: Faster, More Effective  Flowgraph -based Malware Classification

Software similarity problem extended to similarity search over a database.

Find nearest neighbours (by distance) of a query.

Or find neighbours within a distance of the query.

Nearest Neighbour Search

Page 15: Faster, More Effective  Flowgraph -based Malware Classification

The Software Similarity Search

q

Query Malicious

Query Benign

d(p,q)

p

r

Malware

Query

Page 16: Faster, More Effective  Flowgraph -based Malware Classification

Vector distances here are “metric”.

It has the mathematical properties of a metric.

This means you can do a nearest neighbour search without brute forcing the entire database!

Metric Trees

Page 17: Faster, More Effective  Flowgraph -based Malware Classification

System is 100,000 lines of code of C++.

The modules for this work < 3000 lines of code.

System translates x86 into an intermediate language (IL).

Performs analysis on architecture independent IL.

Unpacks malware using an application level emulator.

Implementation

Page 18: Faster, More Effective  Flowgraph -based Malware Classification

Database of 10,000 malware.

Scanned 1,601 benign binaries.

10 false positives. Less than 1%.

Using additional refinement algorithm, reduced to 7 false positives.

Very small binaries have small signatures and cause weak matching.

Evaluation – False Positives

Page 19: Faster, More Effective  Flowgraph -based Malware Classification

Calculated similarity between Roron malware variants.

Compared results to Ruxcon 2010 work.

In tables, highlighted cells indicates a positive match.

The more matches the more effective it is.

Evaluation - Effectiveness

Page 20: Faster, More Effective  Flowgraph -based Malware Classification

Malware Variant Detectionao b d e g k m q a

ao0.4

40.2

80.2

70.2

80.5

50.4

40.4

40.4

7

b0.4

40.2

70.2

70.2

70.5

11.0

01.0

00.5

8

d0.2

80.2

70.4

80.5

60.2

70.2

70.2

70.2

7

e0.2

70.2

70.4

80.5

90.2

70.2

70.2

70.2

7

g0.2

80.2

70.5

60.5

90.2

70.2

70.2

70.2

7

k0.5

50.5

10.2

70.2

70.2

70.5

10.5

10.7

5

m0.4

41.0

00.2

70.2

70.2

70.5

11.0

00.5

8

q0.4

41.0

00.2

70.2

70.2

70.5

11.0

00.5

8

a0.4

70.5

80.2

70.2

70.2

70.7

50.5

80.5

8

ao b d e g k m q aao   0.70 0.28 0.28 0.27 0.75 0.70 0.70 0.75b 0.74 0.31 0.34 0.33 0.82 1.00 1.00 0.87d 0.28 0.29 0.50 0.74 0.29 0.29 0.29 0.29e 0.31 0.34 0.50 0.64 0.32 0.34 0.34 0.33g 0.27 0.33 0.74 0.64 0.29 0.33 0.33 0.30k 0.75 0.82 0.29 0.30 0.29 0.82 0.82 0.96m 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87q 0.74 1.00 0.31 0.34 0.33 0.82 1.00 0.87a 0.75 0.87 0.30 0.31 0.30 0.96 0.87 0.87  

ao b d e g k m q a

ao 0.8

60.5

30.6

40.5

90.8

60.8

60.8

60.8

6

b0.8

8 0.6

60.7

60.7

10.9

71.0

01.0

00.9

7

d0.6

50.7

2 0.8

80.9

30.7

30.7

20.7

20.7

3

e0.7

20.8

00.8

7 0.9

30.8

00.8

00.8

00.8

0

g0.6

90.7

70.9

30.9

3 0.7

70.7

70.7

70.7

7

k0.8

80.9

70.6

70.7

70.7

2 0.9

70.9

70.9

9

m0.8

81.0

00.6

60.7

60.7

10.9

7 1.0

00.9

7

q0.8

81.0

00.6

60.7

60.7

10.9

71.0

0 0.9

7

a0.8

70.9

70.6

70.7

70.7

20.9

90.9

70.9

7

Exact Matching (Ruxcon 2010)

Heuristic Approximate Matching (Ruxcon 2010)

Q-Grams

Page 21: Faster, More Effective  Flowgraph -based Malware Classification

Faster than Ruxcon 2010. Median benign processing time is 0.06s. Median malware processing time is 0.84s. Slowest result may be memory thrashing.

Evaluation - Efficiency

% Samples

Benign Time(s)

Malware Time(s)

10 0.02 0.1620 0.02 0.2830 0.03 0.3040 0.03 0.3650 0.06 0.8460 0.09 0.9470 0.13 0.9780 0.25 1.0390 0.56 1.31

100 8.06 585.16

Page 22: Faster, More Effective  Flowgraph -based Malware Classification

Improved effectiveness and efficiency compared to Ruxcon 2010.

Runs in real-time in expected case. Large functional code base and

years of development time. Happy to talk to vendors.

Conclusion

Page 23: Faster, More Effective  Flowgraph -based Malware Classification

Full academic paper at IEEE Trustcom.

Research page http://www.foocodechu.com

Book on “Software similarity and classification” available in 2012.

Wiki on software similarity and classification http://www.foocodechu.com/wiki

Further Information