sas: semantics aware signature generation for polymorphic worm detection

15
Int. J. Inf. Secur. (2011) 10:269–283 DOI 10.1007/s10207-011-0132-7 SPECIAL ISSUE PAPER SAS: semantics aware signature generation for polymorphic worm detection Deguang Kong · Yoon-Chan Jhi · Tao Gong · Sencun Zhu · Peng Liu · Hongsheng Xi Published online: 22 May 2011 © Springer-Verlag 2011 Abstract String extraction and matching techniques have been widely used in generating signatures for worm detec- tion, but how to generate effective worm signatures in an adversarial environment still remains a challenging problem. For example, attackers can freely manipulate byte distribu- tions within the attack payloads and thus inject well-crafted noisy packets to contaminate the suspicious flow pool. To address these attacks, we propose SAS, a novel Semantics Aware Statistical algorithm for automatic signature genera- tion. When SAS processes packets in a suspicious flow pool, it uses data flow analysis techniques to remove non-critical bytes. We then apply a hidden Markov model (HMM) to the refined data to generate state-transition-graph-based signa- tures. To our best knowledge, this is the first work combining semantic analysis with statistical analysis to automatically generate worm signatures. Our experiments show that the proposed technique can accurately detect worms with con- D. Kong (B ) · P. Liu College of Information Science and Technology, The Pennsylvania State University, University Park, PA, USA e-mail: [email protected] P. Liu e-mail: [email protected] Y.-C. Jhi · S. Zhu Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA e-mail: [email protected] S. Zhu e-mail: [email protected] T. Gong · H. Xi Department of Automation, University of Science and Technology of China, Hefei, China e-mail: [email protected] H. Xi e-mail: [email protected] cise signatures. Moreover, our results indicate that SAS is more robust to the byte distribution changes and noise injec- tion attacks compared to Polygraph and Hamsa. Keywords Worm signature generation · Machine learning · Semantics · Data flow analysis · Hidden Markov model 1 Introduction Computer worm is a great threat to modern network security despite various techniques that have been proposed so far. To thwart worms spreading out over Internet, pattern-based sig- natures have been widely adopted in many network intrusion detection systems; however, existing signature-based tech- niques are facing fundamental countermeasures. Polymor- phic and metamorphic worms (for brevity, hereafter, we mean both polymorphic and metamorphic when we say polymor- phic) can evade traditional signature-based detection meth- ods by either eliminating or reducing invariant patterns in the attack payloads through attack-side obfuscation. In addition, traditional signature-based detection methods are forced to learn worm signatures in an adversarial environment where the attackers can intentionally inject indistinguishable noisy packets to mislead the classifier of the malicious traffic. As a result, low-quality signatures would be generated. Although a lot of efforts have been made to detect polymorphic worms [27], existing defenses are still lim- ited in terms of accuracy and efficiency. To see the limi- tations in detail, let us divide existing techniques against polymorphic worms into two categories. The first type of approach is the pattern-based signature generation, which uses patterns to identify the worm traffic from the nor- mal traffic. Here, the pattern used for a signature is actu- ally the invariant part appeared in lots of malicious worm 123

Upload: deguang-kong

Post on 15-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SAS: semantics aware signature generation for polymorphic worm detection

Int. J. Inf. Secur. (2011) 10:269–283DOI 10.1007/s10207-011-0132-7

SPECIAL ISSUE PAPER

SAS: semantics aware signature generation for polymorphicworm detection

Deguang Kong · Yoon-Chan Jhi · Tao Gong ·Sencun Zhu · Peng Liu · Hongsheng Xi

Published online: 22 May 2011© Springer-Verlag 2011

Abstract String extraction and matching techniques havebeen widely used in generating signatures for worm detec-tion, but how to generate effective worm signatures in anadversarial environment still remains a challenging problem.For example, attackers can freely manipulate byte distribu-tions within the attack payloads and thus inject well-craftednoisy packets to contaminate the suspicious flow pool. Toaddress these attacks, we propose SAS, a novel SemanticsAware Statistical algorithm for automatic signature genera-tion. When SAS processes packets in a suspicious flow pool,it uses data flow analysis techniques to remove non-criticalbytes. We then apply a hidden Markov model (HMM) to therefined data to generate state-transition-graph-based signa-tures. To our best knowledge, this is the first work combiningsemantic analysis with statistical analysis to automaticallygenerate worm signatures. Our experiments show that theproposed technique can accurately detect worms with con-

D. Kong (B) · P. LiuCollege of Information Science and Technology,The Pennsylvania State University, University Park, PA, USAe-mail: [email protected]

P. Liue-mail: [email protected]

Y.-C. Jhi · S. ZhuDepartment of Computer Science and Engineering,The Pennsylvania State University, University Park, PA, USAe-mail: [email protected]

S. Zhue-mail: [email protected]

T. Gong · H. XiDepartment of Automation, University of Science and Technologyof China, Hefei, Chinae-mail: [email protected]

H. Xie-mail: [email protected]

cise signatures. Moreover, our results indicate that SAS ismore robust to the byte distribution changes and noise injec-tion attacks compared to Polygraph and Hamsa.

Keywords Worm signature generation ·Machine learning ·Semantics · Data flow analysis · Hidden Markov model

1 Introduction

Computer worm is a great threat to modern network securitydespite various techniques that have been proposed so far. Tothwart worms spreading out over Internet, pattern-based sig-natures have been widely adopted in many network intrusiondetection systems; however, existing signature-based tech-niques are facing fundamental countermeasures. Polymor-phic and metamorphic worms (for brevity, hereafter, we meanboth polymorphic and metamorphic when we say polymor-phic) can evade traditional signature-based detection meth-ods by either eliminating or reducing invariant patterns in theattack payloads through attack-side obfuscation. In addition,traditional signature-based detection methods are forced tolearn worm signatures in an adversarial environment wherethe attackers can intentionally inject indistinguishable noisypackets to mislead the classifier of the malicious traffic. As aresult, low-quality signatures would be generated.

Although a lot of efforts have been made to detectpolymorphic worms [27], existing defenses are still lim-ited in terms of accuracy and efficiency. To see the limi-tations in detail, let us divide existing techniques againstpolymorphic worms into two categories. The first type ofapproach is the pattern-based signature generation, whichuses patterns to identify the worm traffic from the nor-mal traffic. Here, the pattern used for a signature is actu-ally the invariant part appeared in lots of malicious worm

123

Page 2: SAS: semantics aware signature generation for polymorphic worm detection

270 D. Kong et al.

traffic, such as substring or token sequence. For exam-ple, systems such as Autograph [16], Honeycomb [17],EarlyBird [35], Polygraph [28], and Hamsa [21] extractcommon byte patterns from the packets collected in thesuspicious flow pool. This approach enables fast analy-sis on live traffic, but can be evaded by polymorphicworms since the instances of a well-crafted polymorphicworm could share few or no syntactic patterns in com-mon. Moreover, such a syntactic signature generation pro-cess can be misled by the allergy attack [8], the redherring and the pool positioning attacks [29], and alsoby the noisy packets injected into the suspicious flowpool [32].

The second approach is to identify the semantics-derivedcharacteristics of worm payloads, as in Cover [23], Taint-Check [30], ABROR [22], Sigfree [41], Spector [4], andSTILL [40]. Existing techniques in this approach per-form static analysis and/or dynamic analysis (e.g., emu-lation-based analysis [5,19]) on the packet payloads todetect the invariant characteristics reflecting semantics ofmalicious codes (e.g., behavioral characteristics of thedecryption routine of a polymorphic worm). The detec-tion of the malicious binary code happens within or beforethe process of these code is actually executed, and thecore techniques used include taint flow analysis [30],OS extension [15] and compiler extension [36], hardwaremodifications [12]. These approaches usually have verystrong detection capabilities compared with signature-basedapproaches, but they are very heavylight and may not detectthe malicious code timely. The non-trivial performance over-heads introduced by semantics analysis [2] make it intol-erable in network-based online detection. What is more,for semantic analysis, the payload analysis could be hin-dered by anti-static techniques [40] or anti-emulation tech-niques [3,38].

In this paper, we focus on the polymorphic worms thatcan be locally or remotely injected using the HTTP pro-tocol. To generate high-quality signatures of such worms,we propose SAS, a novel Semantics Aware Statistical algo-rithm that generates semantic-aware signatures automati-cally. Compared with previous approaches, our techniqueis robust to the above evasion techniques that are designed todefeat the signature-based approaches because it considersmore about semantics. Our technique aims at a novel signa-ture that is more robust than the pattern-based signaturesand lighter than the prior behavior-based detection meth-ods. SAS introduces low overhead in its signature match-ing process; thus, it is suitable for the network-based wormdetection. When SAS processes packets in the suspiciousflow pool, it uses data flow analysis techniques to removenon-critical bytes irrelevant to the semantics of the wormcode. We then apply a hidden Markov model (HMM) to therefined data to generate our state-transition-graph (STG)-

based signatures. Since modern polymorphic engines cancompletely randomize both the encrypted shellcode and thedecryptor, we use a probability STG signature to defeatthe absence of syntactic invariants. STG, as a probabilitysignature, can adaptively learn token changes in differentpackets, correlate token distributions with states, and clearlyexpress the dependence among tokens in packet payloads.Besides this, after a signature is generated, the detectoris free of making sophisticated semantic analysis, such asemulating executions of instructions on the incoming pack-ets to match attacks. Our experiments show that our tech-nique exhibits good performance with low false positivesand false negatives, especially when attackers can indistin-guishably inject noisy bytes to mislead the signature extrac-tor. SAS places itself between the pattern-based signaturesand the semantic-derived detection methods, by balancingbetween security and the signature matching speed. As asemantic-based technique, SAS is more robust than mostpattern-based signatures, sacrificing a little speed in signa-ture matching. Based on the statistical analysis, SAS mightsacrifice subtle part of security benefits of in-depth semanticanalysis, for which SAS gains enough acceleration to be anetwork-based IDS. Note that we also show a way to boostthe knowledge from the semantic analysis to instruct theautomatic signature generation process, and our approachhas the potential to further advance the “intelligence” toincrease the analysis accuracy and free people of compli-cated analysis.

Our contributions in this work are in threefold.

– To our best knowledge, our work is the first one couplingsemantic analysis (by introducing static analysis) withstatistical analysis in signature generation process. As aresult, the proposed technique is robust to the (crafted)noisy packets and the noisy bytes.

– We present a state-transition-graph-based method to rep-resent different byte distributions in different states.We explore semantics-derived characteristics beyond thebyte patterns in packets.

– The signature matching algorithm used in our techniqueintroduces low overhead, so that we can apply SAS as anetwork-based worm detection system.

The rest of this paper is organized as follows. In Sect. 2,we summarize the attacks to prior automated signature gener-ation techniques. We then present our semantics aware poly-morphic worm detection technique in Sect. 3. The evalua-tion results are given in Sect. 4. After this, we review therelated works and discuss the advantages and limitations ofSAS in Sects. 5 and 6. At last, we draw the conclusion inSect. 7.

123

Page 3: SAS: semantics aware signature generation for polymorphic worm detection

SAS: semantics aware signature generation for polymorphic worm detection 271

2 Attacks on signature generation

2.1 Techniques to evade detection

Metamorphism and polymorphism are two typical techniquesto obfuscate the malicious payload to evade the detection.Metamorphism [9] uses instruction replacement, equivalentsemantics, instruction reordering, garbage (e.g., NOP) inser-tion, and/or register renaming to evade signature-based detec-tors. Polymorphism [9] usually uses a built-in encoder toencrypt original shellcode and stores the encrypted shell-code and a decryption routine in the payload. The encryptedshellcode will be decrypted during its execution time at a vic-tim site. The decryption routine can be further obfuscated bymetamorphic techniques; the attack code generated by poly-morphic engine TAPION [10] is such an example. We notethat traditional signature-based detection algorithm is easilyto be misled by applying byte substitution or reordering. Wealso doubt whether the invariants always exist in all the mali-cious traffic flows. In fact, we find that for the instances ofthe polymorphic worm Slammer [34] mutated by the CLETpolymorphic engine, the only invariant token (byte) in all ofits mutations is “\x04”, which is commonly found in all SQLname resolution requests.

2.2 Techniques to mislead signature generation

Besides the obfuscation techniques that aim to cause falsenegatives in signature matching, there are also techniquesattempting to introduce false positives and false signatures.For example, the allergy attack [8] is a denial of service(DoS) attack that misleads automatic signature generationsystems to generate signatures matching normal traffic flows.Signature generation systems such as Polygraph [28] andHamsa [21] include a flow classifier module and a signaturegeneration module. The flow classifier module separates thenetwork traffic flows during training period into two pools,the innocuous pool and the suspicious pool. The signaturegeneration module extracts signatures from the suspiciousflow pool. A signature consists of tokens, where each tokenis a byte sequence found across all the malicious packets thatthe signature is targeting. The goal of a signature genera-tion algorithm is to generate signatures that match the max-imum fraction of network flows in the suspicious flow poolwhile matching the minimum fraction of network flows in

the innocuous pool. Generally, existing signature generationsystems have two limitations. First, the flow classifier moduleis not perfect; thus, noise can be introduced into the suspi-cious flow pool. Second, in reality, the suspicious flow pooloften contains more than one type of worms; thus, a cluster-ing algorithm is needed to first cluster the flows that containthe same type of worm. Polygraph [28] uses a hierarchicalclustering algorithm to merge flows to generate a signaturethat introduces the lowest false-positive rate at every step ofclustering process. Hasma [21] uses a model-based greedysignature generation algorithm to select those tokens as asignature which has the highest coverage over the suspiciousflow pool.

Let us illustrate the vulnerability of signature genera-tors such as Polygraph and Hamsa when crafted noises areinjected in the training traffic as shown in Fig. 1. Here, Ni

denotes normal packets, and Wi (1 ≤ i ≤ 2) denotes thetrue worm packets. Let us assume the malicious invariant(i.e., the true signature) in the worm packets consists of twoindependent tokens ta and tb, and each of them has the samefalse-positive rate p (0 < p < 1) if taken as a signature. Letthe worm packets also include the tokens ti j (1 ≤ j ≤ 3),each of which has the same false-positive rate p as a tokenin a true signature; thus, an attacker can craft normal packetsNi s to contain ti j (1 ≤ j ≤ 3). If all these four flows endup being included in a suspicious flow pool, the signaturegeneration process would be misled.

Setting 1: Let the ratio of the four flows (W1, W2, N1, N2)

in the suspicious flow pool be (99:99:1:1). That is, there isonly 1% noise in the suspicious flow pool. According to theclustering algorithm in Polygraph, it will choose to mergethe flows that will generate a signature that has the lowestfalse-positive rate. In this example shown in Fig. 1, the false-positive rate of using signature (ti1, ti2, ti3) by merging flows(Wi , Ni ) is p3 and that of signature (ta, tb) by merging flows(W1, W2) is p2. The former is smaller and thus, the hierar-chical clustering algorithm will merge the flows of Wi withNi , and it will terminate with two signatures (ti1, ti2, ti3).

Setting 2: Let the ratio of the flows (W1, W2, N1, N2)

in the suspicious flow pool be (99:99:100:100). Accordingto Hasma’s model-based greedy signature generation algo-rithm, Hamsa selects the tokens with the highest coveragein the suspicious flow pool. In our example, the coveragesfor signature (ti1, ti2, ti3), and (ta, tb) are 50% and 49.7%,respectively. Hense and Hamsa first selects token (ti1), then

Fig. 1 Suspicious packetflow pool t11 t12 t13

tat22 t23 t21

t11 t12 t13

t22 t23 t21

N1

N2

W1

W2

True invariant t1j t2j Fake invariant

ta

ta

tb

tb

tb

123

Page 4: SAS: semantics aware signature generation for polymorphic worm detection

272 D. Kong et al.

(ti1, ti2), and (ti1, ti2, ti3) as a signature as long as the false-positive rate of signature (ti1, ti2, ti3) is below a threshold.

From the above two cases, we can clearly see that if anattacker can inject “proper” noises into the suspicious flowpool, the wrong signatures would be generated.

3 Our approach

3.1 Why STG-based signature can help?

The fake blending packets mixed in a suspicious flow poolusually do not have many useful instruction code embed-ded in the packet unless they are truly worms. It is foundthat byte sequences that look like code sequences are highlylikely to be dummies (or data) if the containing packethas no code implying function calls [41]. We use seman-tic analysis to filter those “noisy” padding and substitutionbytes to improve the signature quality. Under some condi-tions, the suspicious flow pool can contain no invariants ifwe compute the frequency of each token by simply count-ing them. We find that the distributions of different tokensare influenced by the positions of the tokens in packets,which are instruction-level exhibitions of semantics andsyntax of the packets. In order to capture such seman-tics, we use different states to express different token dis-tributions in different positions in the packets. It is morerobust to the token changes in different positions of thepackets, which correlates the tokens’ distributions with astate, making token dependency relationships clear. Oneissue we want to emphasize here is that different frombut not contrary to the claim in [37], our model is basedon the remaining code extracted from the whole packetsinstead of on the whole worm packets. Our input is free ofnoises compared with their approach before constructing themodel.

3.2 System overview

In Fig. 2, we describe the framework of our approach. Ourframework consists of two phases, semantic-aware signa-ture extraction phase and semantic-aware signature match-ing phase. The signature extraction phase consists of fivemodules: payload extraction, payload disassembly, usefulinstruction distilling, clustering, and signature generation.The signature matching phase is comprised of two mod-ules: payload extraction and signature matching module.Payload extraction module extracts the payload that pos-sibly implements the malicious intent, from a flow whichis a set of packets forming a message. For example, ina HTTP request message, a malicious payload only existsin Request-URI and Request-Body of the whole flow. Weextract these two parts from the HTTP flows for furthersemantics analysis. Disassembly module disassembles aninput byte sequence. If it finds consecutive instructions inthe input sequence, it generates a disassembled instructionsequence as output. An instruction sequence is a sequenceof CPU instructions, which has only one entry point. Avalid instruction sequence should have at least one executionpath from the entry point to another instruction within thesequence. Since we do not know the entry point of the codewhen the code is present in the byte sequences, we exploitan improved recursive traversal disassembly algorithm intro-duced by Wang et al. [41] to disassemble the input. For anN -byte sequence, the time complexity of this algorithm isO(N ). Useful instruction distilling module extracts usefulinstructions from the instruction sequences. Useless instruc-tions are identified and pruned by control flow and data flowanalysis. Payload clustering module clusters the payloadscontaining similar set of useful instructions together. Signa-ture generation module computes STG-based signaturesfrom the payload clusters. Upon completion of training, Sig-nature matching module starts detecting worm packets by

Payload Extraction

Disassemble

Useful instruction distilling

Signature generation

New Traffic Flow

Useful instructions

Useful instructions

Useful instructions

Useful instructions

Suspicious Flow PoolSuspicious Flow Pool

State-transition-graph signature

BLOCK or PASS

Clustering

Signature matching

State-transition-graph signature

State-transition-graph signature

New Traffic Flow

New Traffic Flow

Suspicious Flow Pool

Suspicious Flow Pool

S0: Control Flow

Change

S3:GetPC

S2:Iteration

S1: Decryption

(b)(a)

Fig. 2 a System architecture. b State-transition-graph (STG) model

123

Page 5: SAS: semantics aware signature generation for polymorphic worm detection

SAS: semantics aware signature generation for polymorphic worm detection 273

matching STG signatures against input packets. Shortly, wewill discuss these four modules in detail.

3.3 Useful instruction extraction

The disassembly module generates zero, one, or multipleinstruction sequences, which do not necessarily correspondto real code. From the output of the disassembly module,we distill useful instructions by pruning useless instructions.Useless instructions are those illegal and redundant bytesequences using the technique introduced in SigFree [41].We use the code abstraction technique to emulate theexecutions of the instructions. There are possibly 6 statesin the state transition graph. State U represents unde-fined variable state; state D represents defined but notreferenced variable state, and state R represents defined andreferenced variable state. The other three abnormal states aredefined like this, state DD represents abnormal state define-define, state UR represents abnormal state undefine-refer-ence, and state DU represents abnormal state define-undefine.Basically, the pruned useless byte sequences correspond tothree kinds of dataflow anomalies: UR, DD, DU. When thereis an undefine-reference anomaly (i.e., a variable is refer-enced before it is ever assigned with a value) in an executionpath, the instruction that causes the “reference” is a uselessinstruction. When there is a define-define anomaly (i.e., avariable is assigned a value twice) or define-undefine anom-aly (i.e., a defined variable is later set by an undefined vari-able), the instruction that caused the former “define” is alsoconsidered as a useless instruction. Since normal packets andcrafted noisy packets typically do not contain useful instruc-tions, such packets injected in the suspicious flow pool arefiltered out after the useful instruction extraction phase. Theremaining instructions are likely to be related to the semanticsof the code contained in the suspicious packets. An exampleof polymorphic payload analysis is shown in Fig. 3. Here, theleftmost part is the original packet content in binary, the mid-dle one is the disassembly code of the useful instructions afterremoving the useless one, and the rightmost part is its cor-responding binaries. For example, in Fig. 3, the disassembly

code inc edi appeared in address 350 is pruned because ediis referenced without being defined to produce an undefine-reference anomaly.

Next, we want to further explain the reasons why wewant to extract the useful instructions from the payload.From our observations, certain amounts of “useful instruc-tions” are left invariant across the different polymorphicworm instances even after performing the obfuscations like“junk insertion” or “instruction replacement”. For normalHTTP requests, the byte sequences may still be assembledinto code sequences. However, these coincidence instructionsequences are pruned as useless instructions after rigorousdata flow anomaly analysis above.

3.4 Payload clustering

The useful instruction sequences extracted from polymorphicworms normally contain the following features: (F1) Get-PC: Code to get the current program counter. GetPC codeshould contain opcode “call” or “fstenv.” We explain therationale shortly; (F2) Iteration: Obviously, a polymorphicworm needs to perform iterations over encrypted shellcode.The instructions that can characterize this feature includeloop, rep, and the variants of such instructions (e.g., loopz,loope, loopnz); (F3) Jump: A polymorphic code is highlylikely to contain conditional/unconditional branches (e.g.,jmp, jnz, je); (F4) Decryption: Since the shellcode of a poly-morphic worm is encrypted when it is sent to a victim,a polymorphic worm should decrypt the shellcode duringor before execution. We note that certain machine instruc-tions (e.g., or, xor) are more often found in decryption rou-tine. The reason why we use these four features here is thatfrom our observations, nearly all self-modifying polymor-phic worm packets contain such features even after com-plicated obfuscations. For the non-self-contained code [19],these features may not hold for the engines in the wild (e.g.,Avoid UTF8/tolower) because they are absence of GetPC andself-reference operations. In this case, we should boost othersemantics for automatic signature generation process in ourfuture work.

Fig. 3 a Original packetcontents. b Useful instructions(assembly code). c Usefulinstructions (binary code)

(b) (c)(a)

123

Page 6: SAS: semantics aware signature generation for polymorphic worm detection

274 D. Kong et al.

A decryption routine needs to read and write the encryptedcode in the payload; therefore, a polymorphic worm needs toknow where the payload is loaded in the memory. To our bestknowledge, the only way for a shellcode to get the absoluteaddress of the payload is to read the PC (Program Counter)register [40]. Since the IA-32 architecture does not provideany instructions to directly access PC, attackers have to play atrick to obtain the value in the PC register. As far as we know,currently, three methods are known in the attacker commu-nity: one method uses fstenv and the other two use relativecalls to figure out the values in PC.

In a suspicious flow pool, there are normally multipletypes of worm packets. For a given packet, we first extractthe instructions indicating each of the four features of poly-morphic worms. However, simply counting such instructionsis not sufficient to characterize a polymorphic shellcode. Inreality, some features may appear multiple times in a spe-cific worm instance, while some others may not appear atall. This makes it complicated for us to match a worm signa-ture to a polymorphic worm code. If we measure the similar-ity between a signature and a worm code based on the baresequence of the feature-identifying instructions, an attackermay evade our detection by distributing dummy features indifferent byte positions within the payload or by reorder-ing instructions in the execution path. On the other hand,if we ignore the structural (or sequent) order of the feature-identifying instruction and consider them as a histogram,it might result in an inaccurate detection. So in this work,we consider both of the structural information and statisticalinformation in packet classification and use a parameter δ

to balance between them. In the setting of our experiment(Sect. 4.2), we find when δ = 0.8, all the packets are clus-tered correctly.

Specifically, we define two types of distances: (D1) thefeature distance and (D2) the histogram distance. The firstfeature distance is used to measure the distance in the sequen-tial order, and the second histogram distance is used to mea-sure the distance by the cardinality of each feature vector.In more detail, we keep the sequent order of the featuresappearing in an instruction sequence, in a feature vector. LetD1(v1, v2) denote the feature distance between two featurevectors v1, v2. When v1 and v2 are of the same length, wedefine D1(v1, v2) as the Hamming distance of v1 and v2. Forexample, the feature vector of the instruction sequence shownin Fig. 3 is S = {F3, F4, F2, F1}. Given another feature vec-tor S′ = {F3, F4, F1, F1}, the distance between S and S′ iscomputed as D1(S, S′) = 1. When two feature vectors areof different lengths, we define the distance of the two fea-ture vectors as D1(v1, v2) = max(length(v1), length(v2))−LLCS(v1, v2), where LLCS(v1, v2) denotes the length of thelongest common subsequence of v1 and v2 and length(v1)

denotes the length of v1. For example, if we are givenS′′ = {F3, F4, F1, F3, F1}, distance D1(S, S′′) = 1. We also

measure the histogram distance, the similarity based on thehistograms of two feature vectors. Let D2(v1, v2) denotesthe histogram distance between two feature vectors v1, v2.For example, the histogram of S above is (1, 1, 1, 1) becauseevery feature appears exactly once. Let us assume that thehistogram of feature vector S′ is given as (1, 2, 0, 1). Then,we define D2(S, S′) as the Hamming distance of S and S′,which is 2.

Given two useful instruction sequences, we use both D1

and D2 to determine their similarity. We define the dis-tance between two useful instruction sequences as D =δD1 + (1 − δ)D2, where δ is a value minimizing the clus-tering error. Suppose there are M clusters in total. Let Lm bethe number of packets in cluster m, where m (1 ≤ m ≤ M)

denotes the index of each cluster. When a new packet in a sus-picious flow pool is being clustered, we determine whetherto merge the packet into an existing cluster or to create anew cluster to contain the packet. We start by calculatingthe distance between the new packet and every packet inexisting clusters. If we find one or more clusters with aver-age distance below threshold θ , we add the new packet tothe cluster with the minimum distance among them. Oth-erwise, we create a new cluster for the new packet. Werepeat this process until all packets in the suspicious floware clustered.

3.5 STG-based signature generation

After clustering all the packets in the suspicious pool, webuild a signature from each of the clusters. Unlike prior tech-niques, our signature is based on a state transition graph inwhich each state is mapped to each of the four features intro-duced above (Fig. 2). In our approach, the tokens (eitheropcode or operands in a useful instruction sequence) aredirectly visible. The tokens can be the output of any states,which means that each state has a probability distributionover the possible output tokens. For example, in Fig. 3,“EB” and “55” are tokens observed in different states. Thismatches exactly with the definition of hidden Markov model(HMM) [33]; thus, we use HMM to represent the state tran-sition graph for our signature.

More formally, our STG model consists of four states(N = 4), which forms state space S = {S0, S1, S2, S3}. Letλ = {π, A, B} denote this model, where A is the state tran-sition matrix, B is a probability distribution matrix, and π

is the initial state distribution. When a STG model is con-structed from a polymorphic worm, we use the model as ourSTG-based signature. Our STG model is defined as follows:

– State space S = {S0, S1, S2, S3}, where state S0 is thecontrol flow change state, which correspond to the featureF3. State S1 is the decryption state, which corresponds to

123

Page 7: SAS: semantics aware signature generation for polymorphic worm detection

SAS: semantics aware signature generation for polymorphic worm detection 275

the feature F4. State S2 is the iteration state, which cor-responds to the feature F2. State S3 is the GetPC state,corresponding to F1. The reason why we choose 4 stateshere is that from our observations, most of the polymor-phic packets include all (or parts of ) the above four differ-ent kinds of features. These four states can represent mostof the self-contained shellcode. Actually, those states aredefined based on the prior knowledge induced from theworm code analysis and can be extended to more statesin future work.

– Transition matrix A = (ai j )N×N =⎛⎜⎝

a00 a01 a02 a03

a10 a11 a12 a13

a20 a21 a22 a23

a30 a31 a32 a33

⎞⎟⎠

where ai j = P(next state is S j |current state is Si ), ai j ∈ {S × S → [0, 1]}, and ai j

satisfies∑

j ai j = 1 (0 ≤ i, j ≤ N − 1).

– Let Y be the set of a single byte and Y i denote the set ofi-byte sequences. X = {Y, Y 2, Y 3, Y 4} is the token set inour system because a token in a useful instruction con-tains at most four bytes (e.g., “AAFFFFFF”,the last linein Fig. 3c), which corresponds to the word size of a 32-bitsystem. Let Ot (1 ≤ t ≤ |X |) be a token that is visibleat a certain state, and O = {Ot |Ot ∈ X} be the visi-ble token set at the state. For a real instruction sequencewith T tokens in the useful instruction sequence, thet-length visible output sequence is defined as O[1...t] ={O1, O2, . . . , Ot } (t ≤ T ). Then, we can define theprobability set B as B = {bi (k)}, where bi (k) =P(visible token is Ok |current state is Si ).bi (k) is the probability of Xk on state Si , thus satisfying∑

1≤k≤|X | bi (k) = 1.

– Initial state distribution π = {π0, π1, π2, π3}, whereπi = P(the f irst state is Si ) .

Algorithm 1 is adopted from the segment K-means algo-rithm [33] to learn the structure of hidden Markov model.As the same token can appear at different states with differ-ent probabilities, we manage our model to satisfy Ot ∈ Si ifbi (Ot ) > b j (Ot ) for all j �= i (step 2 and step 3). We alsoremove noises by setting a threshold to discard less-frequent

Algorithm 1 State-Transition-Graph Model Learning Algo-rithmInput: A cluster of the useful instructions of the payload O[1..T ]Output: STG Signature λ = {π, A, B} for the input clusterProcedure:

1: map tokens Ot ∈ X(1 ≤ t ≤ T ) to the corresponding states Si ∈{S0, S1, S2, S3}(0 ≤ i ≤ N − 1)

2: calculate initial probability distribution π based on the probabilities of the firsttoken O1 being on each state Si ∈ {S0, S1, S2, S3} // get π

3: generate the frequent token set for each state Si ∈ {S0, S1, S2, S3} and calculatebi (k)(1 ≤ k ≤ |X |) // get B

4: for i = 0 to N − 1 do5: for j = 0 to N − 1 do

6: ai j ← number(Ot∈Si∧Ot+1∈Sj)number(Ot∈Si )

// get A, here predicate number denotes the

frequency of a token

tokens. For example, if maxi

bi (Ot ) is below the threshold

(e.g., θ0), we ignore this token Ot while constructing a STGmodel.

3.6 Semantics aware signature matching process

After we extract STG-based signatures from the suspiciousflow pool, we can use them to match live network packets.Given a new packet to test, our detector first retrieves thepayload of the packet. Assuming that the payload length ism bytes, the detector checks whether this m-byte payloadmatches any of the existing signatures. If it does not matchany signature (i.e., their deviation distance is above a thresh-old θ2), it is considered as a benign packet. If it matches asingle signature of certain type of a worm, it will be classi-fied as the type of worm associated with the signature. If thepacket matches multiple signatures, the detector classifiesthe packet as the one with the smallest distance among thematching signatures. An advantage of our approach is thatwe need not make complicated analysis on the live packetsbut match the packets byte after byte.

To measure the distance between an m-byte (input) pay-load and a signature, we try to identify the first token, startingfrom the first byte of the payload. We form four candidatetokens of length i (i = 1, 2, 3, 4), where the i-th candidatetoken consists of the first i bytes of the payload. As is shownin Fig. 4, we first select (44), (44, 55), (44, 55, 4A), and

Fig. 4 STG signature matching process

123

Page 8: SAS: semantics aware signature generation for polymorphic worm detection

276 D. Kong et al.

Algorithm 2 Packet Matching AlgorithmInput: Byte sequences I : i0...iN ; signature model λ = {π, A, B}; mean: averagevalue for signature; si ze: sliding window size to be matched with signature.Output: f lag: label sequence malicious, benign.

Procedure:1: f lag← benign2: for pos ← 1 to I.length()− si ze do3: read bytes I[pos,pos+t] (t, η ∈ {0, 1, 2, 3}) // I.length(): the length of byte,

I[pos,pos+t]: bytes in I from position pos to pos + t4: t ⇐ arg max

ηb j (I[pos,pos+η])(0 ≤ j ≤ N − 1)

5: α1( j)← π j b j (I[pos,pos+t])

6: S1 ←N−1∑i=0

α1( j), α∗1 ( j)← α1( j)S1

, l ← 1

7: if S1 < θ2 then8: NotMatch(pos); pos ← pos + t // matching from the next position9: continue;10: else11: for cur_pos ← pos + t to pos + si ze do12: read bytes I[cur_pos,cur_pos+t ′] (t ′ ∈ {0, 1, 2, 3}) from I

13: t ′ ← arg maxη

b j (I[cur_pos,cur_pos+η])

14: if Si < θ2 or max(b j (I[cur_pos,cur_pos+t ′])) < θ1 then

15: Ignore(); //ignore this symbol; θ1, θ2 are threshold16: else

17: αl+1( j)← (N−1∑i=0

α∗l (i)ai j )b j (I[cur_pos,cur_pos+t ′]) (0 ≤ j ≤ N−1)

18: Sl+1 ←N−1∑j=0

αl+1( j), α∗l+1( j)← αl+1( j)

Sl+1, l ← l + 1

19: cur_pos ← cur_pos + t ′

20: log P(I |λ)←si ze∑l=1

log Sl

21: if || log P(I |λ)− mean|| < θ3 then22: Match(); flag← malicious; break; // θ3 is threshold23: pos ← pos + t24: return f lag

(44, 55, 4A, 4F) as the candidate tokens. Then, for each can-didate token Ot , we calculate its probability to appear ineach of the four states in our STG model and assign it to thestate that gives the largest probability bi (Ot ). Let max(Pi )

denotes the maximum of the four bi (Ot ). If max(Pi )

is above a threshold θ1, we choose the candidate tokenyielding max(Pi ) as the real token and ignore the others.Otherwise, all of the four candidate tokens are ignored.In either case, we move to the next four bytes of the pay-load. As for the case in Fig. 4, we will start to check thenext four tokens (55), (55, 4A), (55, 4A, 4F), and (55, 4A,4F, 5D). We will repeat the above process until all m bytesare processed. Finally, we sum up the max(Pi ) and calcu-late its distance from the signature. The deviation distanceD is defined as D = || log P[O[1...m]|λ] − mean|| wherelog P[O[1...m]|λ] is the matching probability value for a m-byte packet, mean is an average matching value for a certaintype of training packets. Assuming that there are l packets in acluster of the same type, and the byte length for each packet isTi (1 ≤ i ≤ l), we have mean = 1

l

∑lk=1 log P[O[1...Tk ]|λ].

The detailed algorithm is shown in Algorithm 2 thatis adapted from classical Forward Algorithm from hiddenMarkov model [33]. In step 3 and 12, we use greedy strategyto select a best token to keep an optimal matching sequencebecause an optimal matching sequence always keeps a partial

optimal matching sequence. size is the length of the slidingwindow for a packet to be matched, which we set 2 to 3times the length of token sequence in a signature. We ignoreless-frequent tokens by setting parameters θ1, θ2 to reducethe chances of false positives and false negatives. θ1, θ2, θ3

are configured during the training phase of STG model.

4 Evaluation

We test our system off-line on massive polymorphic pack-ets generated by real polymorphic engines used by attackers(i.e., CLET [10], ADMmutate [24], JempiScodes [1]) andon normal HTTP request/reply traces collected at out labo-ratory computers. CLET, ADMmutate, and JempiScodes areadvanced polymorphic engines that obfuscate the decryptionroutines by metamorphism such as instruction replacementand garbage insertion. CLET also uses spectrum analysis tocounterattack the byte distribution analysis. PexFnstenvMov,PexFnstenvSub, and Pex are polymorphic engines includedin Metaspoit [26] framework. Opcode of the “xor” instructionis frequently found in the decryption routine of PexFnstenv-Mov and PexFnstenvSub. PexFnstenvMov and PexFnstenv-Sub also use the “fnstenv” instruction for the GetPC code.Pex uses “xor” instruction in the decoder part and also usescall to GetPC.

In evaluation, we also use 100,000 non-attack HTTPrequests/responses for two purposes: to compute false-positive rate and to derive noisy flows to attack signatureextraction. The normal HTTP traffic contains 100,000 mes-sages collected for three weeks at seven workstations ownedby seven different individuals. To collect the traffic, a client-side proxy monitoring incoming and outgoing HTTP traf-fic is deployed underneath the web server. Those 100,000messages contain various types of non-attack data, includingJavaScript, HTML, XML, PDF, Flash, and multimedia data,which render diverse and realistic traffic typically found inthe wild. The total size of the traces is over 1.77 GB. We runour experiments on a 2.4 GHz Intel Quad-Core machine with2 GB RAM, running Windows XP SP2.

4.1 Comparison with polygraph and Hamsa

In this section, we evaluate the accuracy (in terms of falsepositives and false negatives) of our algorithm in comparisonwith Polygraph and Hamsa. We compare the three systems intwo cases: without noise injection attack, with noise injectionattack.

Parameter settings The parameters of Polygraph are setas follows. The minimum token length α is set to 2, the min-imum cluster size is set to 2, and the maximum acceptablefalse-positive rate during the signature generation processis set to 1%. Hamsa in our experiments is built from the

123

Page 9: SAS: semantics aware signature generation for polymorphic worm detection

SAS: semantics aware signature generation for polymorphic worm detection 277

source that we downloaded from the Hamsa hompage [20].The maximum acceptable false-positive rate of Hamsa is setto u = 0.01 during the signature generation process. In ourapproach, the parameters θ0, θ1 are used to prune the tokensthat have small probability to match with the STG signature,and parameter θ2 is used to label the deviation distance duringthe packet matching process. These parameters are config-ured as follows: θ0 = 0.016, θ1 = 0.016, θ2 = 12.000.

Polymorphic engine In this experiment, we use CLETbecause it implements spectrum analysis to attack the bytedistribution analysis performed by existing signature extrac-tors. We generate 1,000 worm instances from CLET, amongwhich 400 instances are used as the training data to generatesignatures, and 600 instances are used to compute the false-negative rate. Four hundred attack messages are of the samepayload type (e.g., connectExec). We also use 100,000 non-attack HTTP requests/responses to compute false-positiverate.

Comparison without noise injection We compare ourmethod with Polygraph and Hamsa, without consideringnoise injection. Fed with the same 400 attack messages, thesignatures generated by Hamsa and the conjunction signaturegenerated by Polygraph are all ‘\x8b’:1,‘\xff\xff\xff’:1,‘\x07\xeb’:1. The state transition path of our signature is(S0 → S1 → S0 → S3). Token sequences ‘\xff\xff\xff’and ‘\x07\xeb’ are the only invariant tokens appearing inthe useful instruction sequences (Fig. 5).

Fig. 5 STG signature example. The bytes used by the signature aremarked in rectangular boxes

Comparison under noise injection We compare SAS,Polygraph, and Hamsa, assuming 1:1 attack-to-noise ratioin the suspicious flow pool. To add the crafted noise to thesuspicious flow pool, we adopt the method used by Per-disci et al. [32]. For each malicious packet wi , we createthe associated fake anomalous packet fi by modifying thecorresponding packet wi . The way to make crafted noisypacket fi is divided into the following six steps. (Step 1)f 0i : create a copy of wi . (Step 2) f 1

i : permute bytes f 0i

randomly. (Step 3) a[ ]: copy k substrings of length l fromwi to array a, but do not copy the true invariant. (Step 4)f 2i : copy the fake invariant substring into f 1

i . (Step 5) f 3i :

inject m-length substring of string v into ( f 2i ), we generate

n (n > m) bytes of string v = {v1, v2, . . . , vn} by select-ing the contiguous bytes in the innocuous packet which sat-isfy 0.05 < P(v|innocuous packets) < 0.20. (Step 6) f 4

i :obfuscate the true invariant by substituting the true invariantbytes in the packet.

To craft non-attack derived noises, we use our 10,000 nor-mal HTTP messages. The suspicious flow pool is composedof 400 CLET-mutated instances and 400 crafted noises. Weconfigure the parameters of noise generator as k = 3, l =5, n = 6, and m = 3. The parameters for SAS, Polygraph,and Hamsa are set as the same as in Case-1.

When we compare SAS with Polygraph, we ignore “trueinvariants” in Step 3 and 6 because we do not know the trueinvariants until the signature is generated. Instead, we per-mute the bytes more randomly to separate and distribute con-tiguous bytes before copying substrings of wi in Step 3. Atopthis, we use even more sophisticated noise injection when wecompare SAS with Hamsa. Specifically, in Step 5, we choosea string v, which satisfies P(v|innocuous packets) < u (uis parameter). Since we set u as described above, Hamsa’sfalse-positive rate will not exceed u even if the injected noisesare taken as signatures.

Comparison results Figures 6a and b show the false-positive and false-negative rates of SAS, Polygraph, and

(a) (b) (c)

Fig. 6 a Comparison of SAS and Polygraph on CLET. b Comparison of SAS and Hamsa on CLET. c Impact of parameters on Different PolymorphicEngines

123

Page 10: SAS: semantics aware signature generation for polymorphic worm detection

278 D. Kong et al.

(a) (b)

(c) (d)

Fig. 7 a Comparison of SAS and Polygraph on Admutate. b Comparison of SAS and Hamsa on ADmutate. c Comparison of SAS and Polygraphon PexFnstenvMov. d Comparison of SAS and Hamsa on PexFnstenvMov

Hamsa in both experiment cases. In Case-1 experiment (i.e.,without noise injection), all the three systems show similaraccuracy. Although SAS shows slightly higher false-positiverate than Polygraph and Hamsa, the false-positive rates ofall three systems are already very low (<0.0008). In Case-2experiment (i.e., with noise injection), the false-positive andthe false-negative rates of SAS have not been affected by thecrafted noise injected to the suspicious flow pool. In contrast,the signature generation process of Polygraph and Hamsahas been greatly misled to add fake invariants taken from thecrafted noises, which results in extremely high false-negativerate. As a result, the signatures generated by Polygraph andHamsa miss more than 20% of attack messages. The false-positive and false-negative rates of Polygraph and Hamsa arestill lower than 1%, which is because they have a thresholdof maximum false-positive rate (say 1%).

Comparison more with polygraph and Hamsa In theprevious section, we compare our approach with Polygraphand Hamsa in terms of false positives and false negativesin the context of worm packets generated by the polymor-phic engine CLET. Here, we compare our approach with

worm packets generated by Admutate and PexFnstenvMov.The comparisons are also made on the following two condi-tions: (1) without noise injection and (2) under noise injec-tion. In both cases, 400 attack messages of each type aregenerated, and 10,000 normal HTTP messages are fed into asuspicious flow pool. Moreover, in noise injection case, 400crafted noises are injected into the suspicious flow pool. Thenoisy packets are generated in the same way as introduced incomparison in the context of CLET [32]. The parameters ofPolygraph and Hamsa are set the same with the comparisonin the context of CLET. The results are shown in Fig. 7. Wecan see clearly that our approach performs far more better inthe case of the noise injection.

Discussion more on the noise injection In the processof crafted noise injection, we adopt the method used by Per-disci et al. [32]. Different noise injection produced differentresults during the process of misleading the signature genera-tion, which is shown in Figs. 6a, b, and 7a–d. From intuition,we can understand that different noisy bytes may mislead thenaive signature learning algorithm distinctly. More detailshave been discussed in the previous paper [32]. However,

123

Page 11: SAS: semantics aware signature generation for polymorphic worm detection

SAS: semantics aware signature generation for polymorphic worm detection 279

Table 1 Accuracy of theSTG-based signatures generatedby SAS

Polymorphic engine False False State transition path ofpositive (%) negative (%) STG-based signature

PexFnstenvMov 0.075 0.40 (S3 → S1 → S2)

CLet 0.072 0.42 (S0 → S1 → S0 → S3)

Admutate 0.062 0.55 (S0 → S1 → S2 → S3)

PexFnstenvSub 0.082 0.51 (S3 → S1 → S2)

Pex 0.070 0.49 (S3 → S1 → S2)

JempiScodes 0.127 0.63 (S0 → S1 → S2 → S3)

here it is clearly to see that our approach is very robust tothose kinds of noise injections.

4.2 Per-polymorphic engine evaluation

In this experiment, we evaluate the impact of different param-eter settings on our approach. We use 6,000 worm instancesgenerated by CLET, Admutate, PexFnstenv-Mov, JempiS-codes, PexFnstenvSub, and Pex. We feed our signatureextractor with 2,400 out of the 6,000 worm instances (400instances from each type of worm) to generate STG-basedsignatures. Then, we use the remaining worm instances toevaluate the extracted signatures. We also inject 20,000 nor-mal packets into the suspicious flow pool to make the packetclustering more difficult.

To avoid the ad hoc way of setting the parameters, theparameter δ is first set to an initial value and then adjusteduntil all the packets are clustered correctly. In our setting, wefind that the structural information is more important thanstatistical information. When we set parameter δ = 0.8, allthe packets in the suspicious pool are grouped in the rightcluster. We aim to find an appropriate δ for the right cluster-ing. The δ can be tuned based on the feedback of clusteringresult. One issue we want to emphasize here is that the set-ting of δ may influence the result. We try to pick out the bestavailable parameters after different tryings. In fact, differentclustering methods may be employed to handle these diversi-fied and complicated data. How to cluster the worm packetswith the least clustering error still remains a challenge issue.Since it is not the most focus point in this paper, we did notgive more comparisons and discussions of different cluster-ing approaches here.

Discussion on the parameter influence We test thefalse-negative and false-positive rates using the remaining36,00 attack instances and the 200,000 normal HTTP mes-sages, respectively (Table 1). We also evaluate the influenceof parameter changes on signature matching as shown inFig. 6c on Polymorphic engine CLET, PexFnStenvMov, andAdmutate, where each data point stands for one group ofparameter settings. We change the value of θ1, θ2 to see

how false-positive rate and false-negative rate change. In ourexperiment, we observe the lowest false-positive rate whenwe set the parameters as θ1 = 0.018, θ2 = 12.000. Althoughwe do not present the entire results due to page limit, ourexperiment results show that the false-negative rate decreasesas θ1 increases. Also, the false-positive rate increases as θ2

increases. These observations are confirmed from the designof our algorithm, as higher θ1 would filter more noises while ahigher θ2 would block more normal packets. The parameters,in practice, can be tuned based on the feedback from trainingand testing dataset, so that we get a locally optimized false-positive rate. Table 1 shows the best configurations obtainedby the above method.

Small training examples In the previous experiment,we use 40% of training samples from each polymorphicengine to obtain the signature comparison results. However,in reality, we may not achieve the luxury of several hundredtraining examples. The advantage of using hidden Markovmodel is that we can still get pretty good results even ifwe do not have many instances as training samples. Thehidden Markov model has very good inference capabilityin case of small training examples since each instance isan-order sequence containing much information for signatureextraction. We make experiment on the 1000 polymorphicworm instances generated from the CLET, Admutate poly-morphic engines, respectively. We choose 100, 200, 300,400 instances as the training examples, respectively, and theremaining examples are left for testing the false-negative rate.Similarly, we use the 10,000 normal HTTP messages to getthe false-positive rate. The comparison results are shownin Fig. 8. We can see that with the decrease in the train-ing examples, the false-negative rate and the false-positiverate increase somehow but very small. One point we wantto emphasize here is that with the changes of the numbersof the training examples, the parameters(e.g.,θ1, θ2) will bechanged according to our algorithm, which makes our algo-rithm more robust to the influence of the number of the train-ing examples. The essential reason is that similar semanticsare captured by the hidden Markov model after the semanticanalysis module proposed in our framework.

123

Page 12: SAS: semantics aware signature generation for polymorphic worm detection

280 D. Kong et al.

Fig. 8 Different trainingexamples influence

Table 2 Performance evaluation

Polymorphic engine Training Matching Analysistime (s) time (s) throughput (Mbps)

PexFnstenvMov 22.901 1.783 10.534

CLet 31.237 2.879 13.655

Admutate 24.833 1.275 12.901

PexFnstenvSub 27.435 2.123 11.097

Pex 21.504 2.463 11.512

JempiScodes 32.198 2.637 14.598

4.3 Performance evaluation

The time complexity of the signature learning algorithm isO(N 2T P), where T is the length of token sequence, P isthe number of the suspicious packets in a clustering, and Nis the number of states. The time complexity of our signaturematching algorithm is O(N 2S · L), where L is the averagelength of token sequences in a signature, S is the total lengthof input packets to match, and N is the number of states. Thetraining time, matching time, and analysis throughput foreach polymorphic engine are shown in Table 2. The trainingtime includes the time to extract useful instructions frompackets. The matching time is the total elapsed time to match600 mutations generated by each polymorphic engine.

From above analysis, we can clearly see that if differ-ent numbers of states in HMM are chosen, the efficiencyof the signature learning and signature matching algorithmsis influenced. For example, if we set N = 5, the time costfor the signature learning and matching algorithms will bothincrease quadratic of �N compared with N = 4. However,signature training is usually done off-line by collecting databeforehand, it is not so sensitive to the time cost. For thesignature matching part, it is still acceptable since N is notat exponential place and can be adapted to the requirementof online detection. Actually, there is a trade-off between the

size of N in hidden Markov model with the detection accu-racy and also the throughput. The empirical studies show thatwe can usually set N = 4 to achieve good detection accuracywith acceptable throughput.

5 Related work

Pattern extraction signature generation There are a lot ofwork on pattern-based signature generation, including hon-eycomb [17], earlybird [35], and autograph [16], which hadbeen shown not to be able to handle polymorphic worms.Polygraph [28] and Hamsa [21] are pattern-based signaturegeneration algorithms, and they are more capable of detect-ing polymorphic worms, but vulnerable to different kindsof noise injection attacks. There are also rich researches onattacks against pattern extraction algorithms. Perdisci et al.[32] present an attack that adds crafted noises into the sus-picious flow to confuse the signature generation process.Paragraph [29] demonstrates that Polygraph and Hamsa arevulnerable to attacks as long as attackers can construct thelabeled samples randomly to mislead the training classifier,and this attack can also prevent or severely delay genera-tion of an accurate classifier. Allergy attacks [8] force thesignature generation algorithm to generate signatures thatcould match the normal traffic, thus introducing high false-positive rate. Gundy et al. [14] present a class of feature omis-sion attacks on signature generation process that are poorlyaddressed by Autograph and Hamsa. Polymorphic blendingattacks [11] are presented by matching the byte frequencystatistics with normal traffic to evade detection. Theoreti-cal analysis of limits of different signature generation algo-rithms is given in [39]. Gundy et al. [13] show that web-basedpolymorphic worms do not necessarily have invariant bytes.A game-theoretical analysis on how a detection algorithmand an adversary could adapt to each other in an adversarialenvironment is introduced in [31]. Song et al. [37] study the

123

Page 13: SAS: semantics aware signature generation for polymorphic worm detection

SAS: semantics aware signature generation for polymorphic worm detection 281

possibility of deriving a model for representing the generalclass of code that corresponds to all possible decryption rou-tines and conclude that it is infeasible. Our work combines thesemantic analysis in the signature generation process, mak-ing it robust to many noise injection attacks (e.g., allergyattack and red herring attack).

Semantic analysis Researches have presented seman-tic-based techniques by making static and dynamic anal-ysis on the binary code. Polychronakis et al. [19] havepresented emulation-based approach to detect polymorphicpayload by emulating the code and detecting decryptionroutines through dynamic analysis. Libemu [2] is anotherattempt to achieve shellcode analysis through code emula-tions. Compared with their works, our approach has higherthroughput and cannot be attacked by anti-emulation tech-niques. Brumley et al. [6] propose to automatically createvulnerability signatures for software. Cover [23] exploitsthe post-crash symptom diagnosis and addresses space ran-domization techniques to extract signatures. TaintCheck [30]exploits dynamic dataflow and taint analysis techniques tohelp find the malicious input and infer the properties ofworms. ABROR [22] automatically generates vulnerabil-ity-oriented signatures by identifying typical characteris-tics of attacks in different program contexts. Sigfree [41]detects the malicious code embedded in HTTP packets bydisassembling and extracting useful code from the packets.Spector [4] is a shellcode analysis system that uses sym-bolic execution to extract the sequence of library calls andlow-level execution traces generated by shellcode. Christod-orescu et al. [7] present a malware detection algorithm byincorporating instruction semantics to detect malicious pro-gram traits. Our motivation is similar, but our work is specificto network packet analysis instead of for file virus. STIIL [40]uses static taint and initialization analysis to detect exploitcode embedded in data streams/requests targeting at web ser-vices. Kruegel et al. [18] present a technique based on thecontrol flow structural information to identify the structuralsimilarities between different worm mutations. A new mali-cious shellcode detection methodology is presented in [5]based on analysis of the input packet information by taking

the snapshot of the process’s virtual memory. In contrastto their work, our work is to generate signatures based onsemantic and statistic analysis.

6 Security analysis

6.1 Strength

Our semantic-based signatures can filter the noises in thesuspicious flow pool and prune the useless instructions thatare otherwise possibly learned as signature; thus, it has goodnoise tolerance. As the STG signature is more complicatedthan previous signatures (e.g., token-sequence signature), itis much harder for attackers to ruin our automatic signa-ture generation by crafting packets bearing both the tokensof normal and attack packets compared with previous sig-natures. Moreover, even if the hackers change the contentsof the attack packets a lot, they can hardly evade our detec-tion since our signature is not based on syntactic patternsbut based on semantic patterns. In addition, the STG sig-nature can match unknown polymorphic worms (which ourdetector has not been trained with) since it has learned cer-tain semantics of the decryption routine from existing poly-morphic worms. Our STG signature matching algorithmintroduces low overhead (analysis throughput is more than10 Mbps); thus, our detector is fast enough to match live pack-ets. Some anti-disassemble techniques like junk byte inser-tion, opaque predicate, and code overlap all aim to immobi-lize linear sweep disassembly algorithms. The disassemblerof the STG signature generation approach is a recursive tra-versal algorithm, which makes our approach robust to suchtypes of anti-disassemble techniques. In Table 3, we sum-marize our benefit in comparison with other signature gen-eration approaches. For STG, it is robust to the attacks fill-ing crafted bytes in the wildcard bytes of the packets (e.g.,coincidental-pattern attack [28] and the token-fit attack [21])since these packets usually fail to pass our semantic anal-ysis process. It is robust to the innocuous pool poisoningattack [28] and allergy attack [8] because our technique can

Table 3 Comparison of SAS

with polygraph and HamsaComparison Polygraph Hamsa SAS

Content (behavior) detection Content Content Content

Semantic related Semantic free Semantic free Semantic related

On-line detection speed Fast Fast Fast

(Crafted) noise tolerance Some Medium Good

Token-fit attack resilience Nearly no Medium Good

Coincidental attack resilience Nearly no Medium Good

Allergy attack resilience Some Some Good

Signature simplicity Simple Simple Complicated

123

Page 14: SAS: semantics aware signature generation for polymorphic worm detection

282 D. Kong et al.

filter the normal packets out for signature generation. Asit is a probability-based algorithm, the long-tail attack [28]will not thwart our matching process. Finally, by discover-ing meanings of each token (i.e., which token is exhibitingwhich feature), our approach explores beyond traditional sig-natures that leverage only the syntactic patterns to matchworm packets.

6.2 Limitations

Here, we discuss about the limitations of the proposed tech-nique and possible methods to mitigate these limitations.First, based on static analysis that cannot handle somestate-of-the-art code obfuscation techniques (e.g., branch-function obfuscation and memory access obfuscation), wecannot generate appropriate signatures if the semantic anal-ysis fails to analyze the packets in the suspicious flow pool.This can be solved through more sophisticated semanticanalysis, such as symbolic execution and abstract interpre-tation techniques. For the techniques thwarting the recur-sive traversal algorithms used in semantic analysis, if wefail to make semantic analysis, we also cannot get the signa-tures. Second, our technique can be evaded if smart attack-ers use more sophisticated encryption and obfuscation tech-niques such as doubly encrypted shellcode with invariantsubstitution. The proposed technique will not be efficientagainst the metamorphic worm where there can be consid-erable variations for each sample of worms. Also, for thenon-self-contained code [19], there may be absence of fea-tures for clustering to generate the signatures. To addressthese issues, emulation-based payload analysis techniquescan be used in the signature extractor and the attack detec-tor; however, state-of-the-art emulation-based techniques arestill lack of performance to be used in a live packet analysis.Although one may doubt the utility of byte-level signatures(e.g., it could not handle the packed code), its performanceis good for practical deployment compared with the emula-tion-based approaches. Also, you may wonder the attack-ers can easily fool the clustering algorithms by avoidingusing such features. In reality, it is rather difficult to con-struct these packets without any of those features, but it ispossible that arbitrarily complex shellcode can be createdlike prose words and sentences in English language [25]. Forthe packed malware embedded in these HTTP data streams,they are also immune to our approach. The solution is tounpack those streams before signature generation. To sum-marize, our semantic analysis step is faced with the samedifficulty as the other semantic approaches (e.g., the unde-cidability of static analysis and the effective binary obfus-cations which may thwart both the disassembly and codeanalysis steps).

7 Conclusion

In this paper, we have proposed a novel semantic-awareprobability algorithm to address the threat of anti-signa-ture techniques, including polymorphism and metamor-phism. Our technique distills useful instructions to generatestate-transition-graph-based signatures. Since our signaturereflects certain semantics of polymorphic worms, the pro-posed signature is resilient to the noise injection attacks tothwart prior techniques. Our experiment results have shownthat our approach is both effective and scalable.

Acknowledgments The authors would like to thank Dinghao Wu forhis help in revising the paper. The work of Zhu was supported byCAREER NSF-0643906. The work of Jhi and Liu was supported byARO W911NF-09-1-0525 (MURI), NSF CNS-0905131, AFOSR FA9550-07-1-0527 (MURI), NSF CNS-0916469, and AFRL FA8750-08-C-0137. The work of Kong, Xi was supported by Chinese High-techR&D (863)Program 2006AA01Z449, China NSF-60774038.

References

1. Jempiscodes—a polymorphic shellcode generator. http://www.shellcode.com.ar/en/proyectos.html

2. Baecher, P., Koetter, M.: Getting around non-executable stack (andfix). http://www.libemu.carnivore.it/

3. Bania, P.: Evading network-level emulation. http://www.packetstormsecurity.org/papers/bypass/pbania-evading-nemu2009.pdf

4. Borders, K., Prakash, A., Zielinski., M.: Spector: automaticallyanalyzing shell code. In: Proceedings of the 23rd Annual Com-puter Security Applications Conference, pp. 501–514 (2007)

5. Gu, B., Bai, X., Yang, Z., Adam, C., Xuan, D.: Malicious shellcodedetection with virtual memory snapshots. In: Proceedings of IEEEInternational Conference on Computer Communications (IEEE IN-FOCOM) (2010)

6. Brumley, D., Caballero, J., Liang, Z., Newsome, J., Song., D.:Towards automatic discovery of deviations in binary implementa-tions with applications to error detection and fingerprint generation.In: Proceedings of the 16th USENIX Security (2007)

7. Christodorescu, M., Jha, S., Seshia, S., Song, D., Bryant, R.:Semantics-aware malware detection. In: 2005 IEEE Symposiumon Security and Privacy (2005)

8. Chung, S.P., Mok, A.K.: Advanced allergy attacks: Does a corpusreally help. In: Recent Advances in Intrusion Detection (RAID),pp. 236–255. Springer, Berlin (2007)

9. Collberg, C., Thomborson, C., Low., D.: A taxonomy of obfus-cating transformations. In: Technical Report 148, University ofAuckland (1997)

10. Detristan, T., Ulenspiegel, T., Malcom, Y., Superbus, M., Under-duk, V.: Polymorphic shellcode engine using spectrum analysis.http://www.phrack.org/show.php?p=61&a=9

11. Fogla, P., Sharif, M., Perdisci, R., Kolesnikov, O., Lee, W.: Poly-morphic blending attacks. In: Proceedings of The 15th USENIXSecurity Symposium (2006)

12. Forsell, M., Leppänen, V.: Mtpa—a processor architecture for mp-socs employing the moving threads paradigm. In: PDPTA, pp.198–204 (2009)

13. Gundy, M.V., Balzarotti, D., Vigna, G.: Catch me, if you can: Evad-ing network signatures with web-based polymorphic worms. In:Proceedings of the First USENIX Workshop on Offensive Tech-nologies (WOOT) Boston, MA (2007)

123

Page 15: SAS: semantics aware signature generation for polymorphic worm detection

SAS: semantics aware signature generation for polymorphic worm detection 283

14. Gundy, M.V., Chen, H., Su, Z., Vigna, G.: Feature omission vulner-abilities: thwarting signature generation for polymorphic worms.In: Proceeding of Annual Computer Security Applications Confer-ence (ACSAC) (2007)

15. Kc, G.S., Keromytis, A.D.: e-nexsh: achieving an effectively non-executable stack and heap via system-call policing. In: ACSAC,pp. 286–302 (2005)

16. Kim, H.A., Karp, B.: Autograph: toward automated, distributedworm signature detection. In: Proceedings of the 13th UsenixSecurity Symposium (2004)

17. Kreibich, C., Crowcroft., J.: Honeycomb: creating intrusion detec-tion signatures using honeypots. In: Proceedings of the Workshopon Hot Topics in Networks (HotNets) (2003)

18. Krugel, C., Kirda, E.: Polymorphic worm detection using structuralinformation of executables. In: 2005 International Symposium onRecent Advances in Intrusion Detecion (2005)

19. Krügel, C., Lippmann, R., Clark, A.: Emulation-based detectionof non-self-contained polymorphic shellcode. In: Recent Advancesin Intrusion Detection, 10th International Symposium, LectureNotes in Computer Science, vol. 4637. Springer, Berlin (2007)

20. Li, Z.: Hamsa code. http://www.zhichunli.org/21. Li, Z., Sanghi, M., Chen, Y., Kao, M.Y., Chavez, B.: Hamsa: fast

signature generation for zero-day polymorphic worms with prov-able attack resilience. In: IEEE Symposium on Security and Privacy(2006)

22. Liang, Z., Sekar., R.: Automatic generation of buffer overflowattack signatures: An approach based on program behavior mod-els. In: Proceedings of the Anual Computer Security ApplicationsConference (2005)

23. Liang, Z., Sekar., R.: Fast and automated generation of attack signa-tures: A basis for building self-protecting servers. In: Proceedingsof the 12th ACM Conference on Computer and CommunicationsSecurity (2005)

24. Macaulay, S.: Admmutate: polymorphic shellcode engine. http://www.ktwo.ca/security.html

25. Mason, J., Small, S., Monrose, F., MacManus, G.: English shell-code. In: ACM Conference on Computer and CommunicationsSecurity. ACM (2009)

26. Moore, H.: The metasploit project. http://www.metasploit.com27. Moser, A., Kruegel, C., Kirda, E.: Limits of static analysis for

malware detection. In: Proceedings of the 23rd Anual ComputerSecurity Applications Conference (2007)

28. Newsome, J., Karp, B., Song, D.: Polygraph: automatic signa-ture generation for polymorphic worms. In: IEEE Symposium onSecurity and Privacy (2005)

29. Newsome, J., Karp, B., Song, D.: Paragraph: thwarting signaturelearning by training maliciously. In: Recent Advances in IntrusionDetection (RAID), pp. 81–105. Springer, Berlin (2006)

30. Newsome, J., Song, D.: Dynamic taint analysis for automatic detec-tion, analysis, and signature generation of exploits on commod-ity software. In: Proceedings of Network and Distributed SystemSecurity Symposium (2005)

31. Pedro, N.D., Domingos, P., Sumit, M., Verma, S.D.: Adversarialclassification. In: 10th ACM SIGKDD Conference On KnowledgeDiscovery and Data mining, pp. 99–108 (2004)

32. Perdisci, R., Dagon, D., Lee, W.: Misleading worm signature gener-ators using deliberate noise injection. In: Proceedings of The 2006IEEE Symposium on Security and Privacy (2006)

33. Rabiner, L.R.: A tutorial on hidden markov models and selectedapplications in speech recognition. Proceedings of the IEEE 77(2),257–286 (1999)

34. Ray, E.: Ms-sql worm. http://www.sans.org/resources/malwarefaq/ms-sql-exploit.php

35. Singh, S., Estan, C., Varghese, G., Savage, S.: Earlybird System forReal-Time Detection of Unknown Worms, Technical Report. Uni-versity of California at San Diego, San Diego (2003)

36. Smirnov, A., cker Chiueh, T.: Dira: Automatic detection, identifi-cation and repair of control-hijacking attacks. In: NDSS (2005)

37. Song, Y., Locasto, M.E., Stavrou, A., Keromytis, A.D., Stol-fo., S.J.: On the infeasibility of modeling polymorphic shellcode.In: Proceedings of the 14th ACM conference on Computer andCommunications Security(CCS), pp. 541–551 (2007)

38. Szor, P.: The Art of Computer Virus Research and Defense.pp. 112–134. Addison Wesley, Upper Saddle River (2005)

39. Venkataraman, S., Blum, A., Song., D.: Limits of learning-basedsignature generation with adversaries. In: Proceedings of the 15thAnnual Network and Distributed System Security Symposium(2008)

40. Wang, X., Jhi, Y.C., Zhu, S., Liu, P.: Still: exploit code detectionvia static taint and initialization analyses. In: Proceedings of AnualComputer Security Applications Conference (ACSAC) (2008)

41. Wang, X., Pan, C.C., Liu, P., Zhu, S.: Sigfree: a signature-freebuffer overflow attack blocker.In: 15th Usenix Security Sympo-sium (2006)

123