a fast flowgraph based classification system for packed and polymorphic malware on the endhost

20
Silvio Cesare and Yang Xiang School of Management and Information Systems Centre for Intelligent and Networked Systems Central Queensland University

Upload: silvio-cesare

Post on 29-Nov-2014

1.390 views

Category:

Technology


1 download

DESCRIPTION

Presented at AINA 2010. Full paper available on my home page.

TRANSCRIPT

Page 1: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Silvio Cesare and Yang XiangSchool of Management and Information

SystemsCentre for Intelligent and Networked Systems

Central Queensland University

Page 2: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

MotivationMalware, short for malicious software, is hostile,

intrusive, or annoying software and program code.

Malware is a significant problem in distributed computer systems and in endhost security.

To prevent malware causing damage, untrusted programs can be analysed to identify malicious intent before they are allowed to execute.

Many malware have variants and detection of unknown malware variants provides benefit.

Page 3: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Introduction Automated malware analysis can be dynamic or static.

Traditional Antivirus Static. Must be efficient and respond simultaneously to users’ productivity

demands. String signatures based on byte level content are the dominant

approach. Efficient, but not always effective with malware variants.

Polymorphism Describes malware variants sharing a common history of code. May come automatically from code mutation, or manually created by

malware authors for code reuse. Byte level content may vary significantly.

Page 4: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Introduction (cont)Static analysis can provide non traditional features to

characterize malware.

Control flow describes the possible execution paths through a malware.

Control flow is considered more invariant in polymorphic malware than traditional features.

Malware often hinders control flow analysis and static analysis through code packing. Code packing hides, encrypts, compresses or obfuscates malware. Automated unpacking reverses the obfuscation, and is required for

effective malware classification.

Page 5: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Our ContributionWe propose an algorithm to identify malware

variants by determining program similarity through estimating isomorphic control flow graphs.

We implement and evaluate our idea in a novel prototype system.

We demonstrate the system is fast enough for desktop adoption on the endhost.

Page 6: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Related WorkAPI call sequences.

N-Grams, n-perms of byte level content.

Basic block matching using edit distances, inverted indexes, bloom filters.

Approximate matching of call graphs.

Approximate matching of control flow graphsOur approach is more effective than byte level

approaches , and more efficient than existing flowgraph based systems.

Page 7: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

The Software Similarity ProblemThe software similarity problem is to determine the similarity

between programs. A real number between 0 and 1 ; 0 is not at all similar, 1 is identical.

Calculated by looking at invariant characteristics between programs.

Given a query program, is it malicious? A high similarity between the query program to existing malware, identifies it as malicious.

Implemented by performing a range or similarity search of a query program to identify similar neighbours from a malware database.

Our system looks at static software similarity. A similarity >= 0.6 indicates a variant. 0.6 chosen using manual and empirical evaluation.

Page 8: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

System Design and ImplementationIdentify if query program is packed.

Unpack.

Generate Control Flow Graphs.

Generate Flowgraph Signatures.

ClassifyFind high similarity between signatures and existing

malware.

Update malware database with variant information.

Page 9: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

System Design and Implementation

Unknown Win32

Executable

Packed?(A)

Unique Signature?

Similar to Database Instance?

(E,F)

Yes

Malware Database

Non Malicious

Malicious

No

Yes

No

Unpack(B)

Known Malware

from Honepot

Known Malware from

Honeypot?

YesStore

No

Yes

Generate Signatures

(C, D)

Block diagram of the malware classification system.

Page 10: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Flowgraph SignaturesA flowgraph signature is defined as the string

representing the graph after labelling the nodes using a depth first order traversal of the graph.

This signature or graph invariant is used in estimating graph isomorphism by testing signatures for equality.

The signature string can be hashed to allow for more efficient searching – we use crc64.

Normalized weight of a procedure or flowgraph:

Similarity ratio between two flowgraphs x and y:

ii

xx B

Bweight

yx

yxwed ,0

,1

Page 11: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Flowgraph Signatures

42

3

T F

TT

1

(1 -> 2), (1 -> 4)(2 -> 3), ()(), ()(4 -> 3), ()

A depth first ordered flowgraph and its signature.

Page 12: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Malware ClassificationDice coefficient is a measure of similarity between two sets:

We represent a program as a set of control flow graph signatures and use the weighted Dice coefficient to show similarity between programs.

The weights have been normalized so the equation simplifies to the sum weights of the flowgraphs common to both sets. We define the asymmetric similarity as:

Two sets of weights are possible representing either the query or the database weight.

Program Similarity :

iii

iii

iii

i

BxwAxw

BAxwBAs

2),(

i

xedx iiweightwS

baSSbaS ),(

Page 13: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Improving Performance in Malware ClassificationTo improve performance, we do not perform the program

similarity function linearly or exhaustively for each malware in the database.

We propose a novel algorithm to search the entire database for similar sets to the query.

Iterate through the query program’s procedures. Find the procedure’s matching flowgraphs and malware from

the database. Building the asymmetric similarities incrementally. Processing unique or none matching flowgraphs first. Pruning low similarity objects, then processing the remaining

flowgraphs.

Page 14: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

AnalysisExpected time to classify a query is O(NlogM)

N is the number of procedures/control flow graphs in the query. M is the flowgraph database size.

Worst time is O(NlogM + AN2) A is the number of highly similar malware to the query.

In previous literature of approximate call graph matching. Pairwise similarity complexity is O(N3). Searching the database used metric trees with logarithmic search

time, but with growth also exponential to the dimensionality of the objects.

Binary trees in our system have more predictable and efficient performance.

Page 15: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Evaluation - Effectiveness

•Similarity matrices for malware familieskleznetskyroron

a b c d g h

a 0.76 0.82 0.69 0.52 0.51

b 0.76 0.83 0.80 0.52 0.51

c 0.82 0.83 0.69 0.51 0.51

d 0.69 0.80 0.69 0.51 0.50

g 0.52 0.52 0.51 0.51 0.85

h 0.51 0.51 0.51 0.50 0.85

klez

aa ac f j p t x y

aa 0.74 0.59 0.67 0.49 0.72 0.50 0.83

ac 0.74 0.69 0.78 0.40 0.55 0.37 0.63

f 0.59 0.69 0.88 0.44 0.61 0.41 0.70

j 0.67 0.78 0.88 0.49 0.69 0.46 0.79

p 0.49 0.40 0.44 0.49 0.68 0.85 0.58

t 0.72 0.55 0.61 0.69 0.68 0.63 0.86

x 0.50 0.37 0.41 0.46 0.85 0.63 0.54

y 0.83 0.63 0.70 0.79 0.58 0.86 0.54

netksy

ao b d e g k m q a

ao 0.44 0.28 0.27 0.28 0.55 0.44 0.44 0.47

b 0.44 0.27 0.27 0.27 0.51 1.00 1.00 0.58

d 0.28 0.27 0.48 0.56 0.27 0.27 0.27 0.27

e 0.27 0.27 0.48 0.59 0.27 0.27 0.27 0.27

g 0.28 0.27 0.56 0.59 0.27 0.27 0.27 0.27

k 0.55 0.51 0.27 0.27 0.27 0.51 0.51 0.75

m 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58

q 0.44 1.00 0.27 0.27 0.27 0.51 1.00 0.58

a 0.47 0.58 0.27 0.27 0.27 0.75 0.58 0.58 roron

Page 16: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Evaluation - Efficiency

Time(s) Num. of Samples

0-1 299

1-2 401

2-3 46

3-4 30

4-5 32

5+ 1

Time(s) Num. of Samples

0.0 0

0.1 139

0.2 80

0.3 42

0.4 28

0.5 10

0.6 10

0.7 3

0.8 6

0.9 5

1-2 17

2+ 6

Malware processing time. Benign processing time.

Page 17: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Evaluation - Scalability

Database Size 1000 2000 4000 8000 16000 32000 64000

Time(ms) < 1 < 1 < 1 < 1 < 1 < 1 < 1

Scalability.

Page 18: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

Evaluation - AccuracySimilarity Matches

(approx.)Matches (exact)

0.0 105497 97791

0.1 2268 1598

0.2 637 532

0.3 342 324

0.4 199 175

0.5 121 122

0.6 44 34

0.7 72 24

0.8 24 22

0.9 20 12

1.0 6 0

False positive evaluation.

cmd.exe calc.exe netsky.aa klez.a roron.aocmd.exe 0.00 0.00 0.00calc.exe 0.00 0.00 0.00 0.00netsky.aa 0.00 0.00 0.15 0.09klez.a 0.00 0.15 0.13roron.ao 0.00 0.00 0.09 0.13

Similarity matrix for non similar programs.

Page 19: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

LimitationsDisassembly and control flow reconstruction of

an obfuscated program is an undecidable problem.

In practice, analysis is possible because malware is obfuscated using packing.

However, automated unpacking using application level emulation is detectable.

Packing using instruction virtualization is also resistant to automated unpacking.

Page 20: A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost

ConclusionMalware variants can be detected based on similarity in their

control flow.

We proposed estimating isomorphic control flow graphs using graph invariants.

We implemented this approach in a prototype system.

Our system was able to detect real malware variants.

It was resilient to false positives, and had logarithmic performance in the expected case.

It was shown to have suitable performance for use on the endhost.