metamorphic malware analysis and detection
DESCRIPTION
ABSTRACT : -------------------- Modern malware that are metamorphic or polymorphic in nature mutate their code by employing code obfuscation and encryption methods to thwart detection. Thus, conventional signature based scanners fail to detect these malware. In order to address the problems of detecting known variants of metamorphic malware, we propose a method using bioinformatics techniques effectively used for Protein and DNA matching. Instead of using exact signature matching methods, more sophisticated signature(s) are extracted using multiple sequence alignment (MSA). The results show that the proposed method is capable of identifying malware variants with minimum false alarms and misses. Also, the detection rate achieved with our proposed method is better compared to commercial antivirus products used in the study. Status: ---------- This work has been accepted by 8th IEEE International Conference on Innovations in Information Technology (Innovations'12). Link: ------- http://ieeexplore.ieee.org/xpl/login.jsp?reload=true&tp=&arnumber=6207739&url=http://ieeexplore.ieee.org/iel5/6203543/6207707/06207739.pdf?arnumber=6207739 e-mail: [email protected]TRANSCRIPT
Bioinformatics Techniques for Metamorphic Malware Analysis
and Detection
Malaviya National Institute of Technology, Jaipur
and Detection
Supervisors:
Dr. M. S. Gaur
Dr. V. Laxmi
By:
Grijesh Chauhan
(2009PCP116)
Outline
� Malware & Metamorphic malware
� Motivation
� Objective
� BioinformaticsTechniques� BioinformaticsTechniques
� MOMENTUM
� Dataset
� Result & Analysis
� References
Malaviya National Institute of Technology, Jaipur
Malware
� Malware are software with intentions to infect andreplicate.
� Threats
� Lossof data
Malaviya National Institute of Technology, Jaipur
� Lossof data
� Degrades computer system performance
� Identity threat
� Two broad categories
� Metamorphic: Virus body changes on each replication
� Polymorphic: Encrypts malicious payload to avoiddetection
Metamorphic Malware[1/2]
� Metamorphic malware have similarfunctionality, different structure and signature.
Malaviya National Institute of Technology, Jaipur
� Similar to genetic diversity in Biology.
Variant -1 Variant -2 Variant -3
Metamorphic Engine
Diagram depicts metamorphic malware variants with reordered code
Metamorphic Malware[1/2]
� Metamorphic Malware automatically re-codes itselfeach time it propagates or is distributed.
� Conventional signature based scanners areineffective for detecting variants of same malware.
Malaviya National Institute of Technology, Jaipur
� Sophisticated signature(s) are required to detectmetamorphic variants of malware.
Motivation
� Variants of metamorphic malware are generatedusing a small embeddedmetamorphic engine todefeat detection [2].
� Limited number of instructions are used to generate
Malaviya National Institute of Technology, Jaipur
variants so as to preserve functionality.
� Metamorphic malware like DNA/ protein sequencesmutate from generation to generation, they inheritfunctionality and some structural similarity withancestral malware.
Objective
� To devise a method for detection of metamorphicmalware and its variants.
� To extract the abstract signature(s) usingBioinformatics sequence alignment
Malaviya National Institute of Technology, Jaipur
� base code is preserved in different generations, obfuscatedusing junk code or equivalent instructions etc.
� To identify unseen malware samples using bestrepresentative signatures (group/single) of a family.
Sequence Alignment [1/2]
� Sequence alignment is a way of arrangingDNA/Protein sequences to identify regions ofsimilarity to infer functional, structural orevolutionary relationship.
�
Malaviya National Institute of Technology, Jaipur
� Alignment Methods
� Global Alignment - align sequences end to end.
� Local Alignment - align substring of one sequence withsubstring of other.
� Multiple Sequence Alignment (MSA) - align more thantwo sequences.
Sequence Alignment [2/2]
� Global alignmentL G P S S K Q T G K G S - S R I D N
L N - I T K S A G K G A I M R L D A
� Localalignment
Malaviya National Institute of Technology, Jaipur
� Localalignment- - - - - - T G - G - - - - - - -
- - - - - - A G K G - - - - - - -
� Alignment Parameter� Match
� Mismatch
� GapPoint of Mutation
Multiple Sequence Alignment� MSA is extension of pairwise alignment for more
than two sequences.
� It is used to identify conserved regions across agroup of sequences.
Malaviya National Institute of Technology, Jaipur
M1 M2 M3 M4 M5
add add add - add
- push push push push
Mov mov mov mov mov
- call jmp jz jmp
jmp jmp mov mov mov
• M i – ith Malware instance
Implementation of MSA
� MSA is implemented usingProgressive technique(ClustalW[9])
� Progressive MSA follows three steps:
� Determine similarity between each pair by pairwise
Malaviya National Institute of Technology, Jaipur
� Determine similarity between each pair by pairwisealignment.
� Construct aguided tree (Phylogenetic tree) to representevolutionary relationship.
� MSA is build by aligning closely related groups to mostdistant group according toguided tree.
Phylogenetic Tree
� Phylogenetic Tree depict evolutionary relationship among the sequences.
� To form groups of similar
viruses
Malaviya National Institute of Technology, Jaipur
viruses
� Guides MSA progressively
to align closer groups first
A B D F
E
( (E,(A,B)), (D,F) )
Similarity Measurement
� Alignment Score : Is the sum of score specifiedfor each aligned pair of mnemonics. Higher thescore more similar the sequences.
� Distance (d) : Calculated using followingformulas
Malaviya National Institute of Technology, Jaipur
formulas
�
�
Higher the distance more dissimilar the sequences
)#(#
#
matchmismatch
mismatchNd
+=
)##(# gapmatchmismatchLd ++=• Nd is Normalized distance, Ld is Levenshtein distance
Identification of Base Malware
� Base malware in a family is most similar to rest allwith highest sum of score using pairwise alignment(SoP[3]).
M1 M2 M3 M4 SoPM2
Malaviya National Institute of Technology, Jaipur
M1 - 7 -2 1 6
M2 7 - -3 0 4
M3 -2 -3 - 1 -4
M4 1 0 1 - 2
is Base Malware Score Matrix
M1
M3
M4
M2
M1
• M i – ith Malware instance
Implementation Method
� MetamOrphic Malware ExploratioN TechniqueUsing MSA (MOMENTUM) demonstrate theapplicability of Bioinformatics Techniques formetamorphic malware analysis and detection.
�
Malaviya National Institute of Technology, Jaipur
� Two phase of MOMENTUN are:
� Analysis of Metamorphism in Tools/Real Malware
� Signature Modelling and Testing
MOMENTUM [1/2]Metamorphic Families
(Virus Tools and Real Malware)
Intra-Family pair-wise Alignment
Malaviya National Institute of Technology, Jaipur
Distance Matrix Base file Alignments of twofiles
Metamorphic?Inter-Family pair-wise
Alignment
FamiliesOverlap ?
Obfuscation ?
• Flow diagram for metamorphism analysis
MOMENTUM [2/2]
Training Set Testing Set
Divide data set in two parts
Malaviya National Institute of Technology, Jaipur
Extract Group Signature
Testing with single and group signatures
Single Signature
Scan Logs
Threshold Threshold
• Diagram depicts Signature Modelling and Testing
MSA Signature� MSA signature (single signature) is a sequence of
preserved mnemonics in alignment.
M1 M2 M3 M4 M5 MSA Sign
push push - - push push
Mt
push
Malaviya National Institute of Technology, Jaipur
� Mnemonic that appears more than 50% in a rowis included in MSA signature.
- - jump jump jump jump
mov mov - lea xor
call call call call call call
push mov mov - mov mov
• M i – ith Malware instance and Mt – Test Sample
jump
lea
call
push
Group Signature
� Group signature is extracted from single signaturefor each subgroup.
� Sub groups are formed using evolutionary relationship.
� Single signature is extracted for each subgroup andcombinedin theform of wildcard.
Malaviya National Institute of Technology, Jaipur
combinedin theform of wildcard.
� DiagramSign1 Sign2 Sign3 Sign4 Sign5 Group Sign
push push - - push push
jz jz jump jump jump jump|jz
mov mov - lea xor mov|lea|xor
call call call call call call
- mov mov - push mov|push
• Signi – Signature for ith sub-group in a family
Mt
push
jz
lea
call
push
Threshold
Sign
0 B B M M Score
. . . . . .
Benign Malware
Malaviya National Institute of Technology, Jaipur
Threshold0 Bmin Bmax Mmin Mmax
Score
Where:Bmin Benign with minimum score
Bmax Benign with maximum score
Mmin Malware with minimum score
Mmax Malware with maximum score
Threshold (Bmax + Mmin) /2 , ( Threshold > Bmax )
Dataset [1/2]
Dataset Description:
Type Source #Family #instances
Synthetic NGVCK, PSMPC, G2,
MPCGEN46 1051
User Agencies
Malaviya National Institute of Technology, Jaipur
� * consists of unknown viruses (in test set).
� Dataset is equally divided into training andtesting set.
RealUser Agencies
52 + 1* 1209VxHeavens
Benign System32,Cygwin etc. 1 150
1*
Dataset [2/2]
� All samples are in Portable Executables (PE)format.
� Samples are unpacked using
� Dynamicunpacker(EtherUnpack[7] )
Malaviya National Institute of Technology, Jaipur
� Dynamicunpacker(EtherUnpack[7] )
� Signature based unpacker (GUNPacker [10])
� Malware families are created from combinedscanned results of 14 antiviruses.
� Benign samples are also scanned.
Result for Intra Family
0.05
0.1
0.15
0.2
0.25
0.3
Ave
rage
Dis
tanc
e
Global
Local
Levenshtein
Malaviya National Institute of Technology, Jaipur
� Non zero values indicates presence of metamorphism insynthetic data.
� Levenshtein distance is high due to junk code insertion.� Inspite of high values of global distance, local distances are
low in most of the samples. This indicates presence of similarregions in code.
0
NGVCK PSMPC G2 MPCGEN
• Average distance is between 0 to 1
Result for Inter Family
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Ave
rage
Dis
tnce
Global
Local
Levenshtein
Malaviya National Institute of Technology, Jaipur
� Distance is less than intra family distance. This indicatesmost of malware share some base code.
� Levenshtein distance is higher because of change infunctionality.
0
0.1
NGVCK PSMPC G2 MPCGEN VX HEAVENS
• Average distance is between 0 to 1
Comparative Analysis
VIRUS TYPEReplacements/
AlignmentAvg. SoD OBFUSCATION
NGVCK 47 1.03 Average Simple
G2 3 1.45 Low Simple
MPCGEN 31 0.61 Average Simple
Malaviya National Institute of Technology, Jaipur
MPCGEN 31 0.61 Average Simple
PSMPC 1 1.35 Low Weak
Vx-Heavens 122 8.3 Large Complex
� Viruses generated using tools belong to same family.� Families of real malware are distinct.� In PSMPCloop andjump instructions contribute for
obfuscation this increases the distance between samples.� NGVCK viruses overlaps with real malware (Savior).
• SoD – Sum of distances of a family with rest other family
Detection Results
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Eva
luat
ion
Met
rics
MSA Single
Group Signature
Malaviya National Institute of Technology, Jaipur
� 95.5% of malware is detected with MSA signature, detectionwith Group signature is 72.4% .
� 53% of benign is falsely detected as malware with MSAsignature due to loss mnemonics used for mutation inmalware.
� Group signature preserves point of mutation that is absent inbenign samples.
0
0.1
TPR FPR
MOMENTUM with Antiviruses
20
30
40
50
60
70
80
90
Det
ecti
on R
ate
Malaviya National Institute of Technology, Jaipur
� MOMENTUM (group signature) is found to be comparableto best ant-viruses.� Out of 35 undetected malware withantiviruses, MOMENTUM could detect 20 malware.
0
10
20
Scope for Improvement
� Instead of same mismatch score, computeweighted score for each pair of mnemonics usingfrequency of mismatches.
� In the alignment, operand part can be consideredto verify actualchanges(replacement/gap).
Malaviya National Institute of Technology, Jaipur
to verify actualchanges(replacement/gap).
� This can fetch the way morpher preservesfunctionality.
List of Publications[1] Vinod P., V.Laxmi, M.S.Gaur, Grijesh Chauhan
Detecting Malicious Files using Non-Signature based Methods,(To appear) Oxford Computer Journal.
[2] Vinod P., V.Laxmi, M.S.Gaur, Grijesh ChauhanMalware Detection using Non-Signature based Method, In
Malaviya National Institute of Technology, Jaipur
Malware Detection using Non-Signature based Method, InProceeding of IEEE International Conference on NetworkCommunication and Computer-ICNCC 2011, pp-427-43, DOI:978-1-4244-9551-1/11.
References[1] E.Karim, A.Walenstein, A.Lakhotia, “Malware Phylogeny using Permutation
of code”, In Proceedings of EICAR 2005, pp 167-174
[2] M.R. Chouchane and A. Lakhotia , “Using engine signature to detect metamorphic malware”, In Proceedings of the 4th ACM workshop on Recurring malcode, WORM '06, 2006,73-78.
Malaviya National Institute of Technology, Jaipur
[3] Mona Singh, " Multiple Sequence Alignment ", Lecture Notes:www.cs.princeton.edu/~mona/Lecture/msa1.pdf (Last viewed on 14-6-2011)
[4] Mona Singh, " Phylogenetics ", Lecture Notes:www.cs.princeton.edu/~mona/Lecture/msa1.pdf (Last viewed on 14-6-2011)
[5] T. Smith and M. Waterman, “Identification of Common Molecular Subsequences”, Journal of Molecular Biology, pp 195-197, 1987
[6] Mark Stamp, Wing Wong. "Hunting for metamorphic engines". Journal in Computer Virology, 2(3):211-229
References[7] Ether for Malware Unpacking: http://ether.gtisc.gatech.edu/malware.html
(Last viewed on 14-6-2011)
[8] Jian Li, Jun Xu, Ming Xu, HengiLi Zhao, Ning Zheng, “MalwareObfuscation Measuring via Evolutionary Similarity”, In Proceedings of IEEEInt. Conference on Future Information Network 2009.
Malaviya National Institute of Technology, Jaipur
[9] Larkin MA et al, " Clustal W and Clustal X version 2.0 ". Bioinformatics, 23, 2947-2948, 2007.
[10] GUnPacker : http://www.woodmann.com/collaborative/tools/index.php/GUnPacker(Last viewed on 14-6-2011)
Thanks!
Malaviya National Institute of Technology, Jaipur