ngram and signature based malware detection in android platform

Ngram and Signature Based Malware Detection in Android Platform

COMPGS98 Research Project in Software Engineering

Munir Geden Supervisor: Dr. Jens Krinke

A new instance for every 14 sec.!

Problem Statement

We have investigated the usage of raw n-gram (RQ1) bytes and other meaningful signatures (RQ2), extracted from different types of reverse engineering products (RQ3) of apk files, as classifier features to detect

malicious Android apps by using different feature selection methodologies (RQ4).

Additionally we have carried out the same investigations for unseen malware families (RQ5).

3

4[1] Davies, Julius, et al. "Software bertillonage: finding the provenance of an entity." Proceedings of the 8th working conference on mining software repositories. ACM, 2011.

apktool

unzip

classes.dexfiles

apk files

*.class files

n-grams

apktool

*.smalifiles field, method,

class fingerprints,opcodes, constant pool strings

signatures

BCEL

dex2jar

AndroidManifest.xml

files

MANIFEST.MFfiles

permissions, intent actions, hardware components

file checksumsxml-parser

SHA-1 Digests

Figure 1: Experimental stage

features in our experiments.Smali Files: The second type of reverse engineering prod-

uct that we have analyzed is the smali files which are thehuman readable assembly code representation of the dexclasses based on Jasmin Java assembly language [1]. Byusing apktool library, we have generated the correspondingsmali files of each class to use under the ngram model.

Class Files: Exploiting the dex2jar library, which canreversely translate register-based Dalvik classes to the stack-based Java classes, we have generated JVM byte-code classfiles. As we have used the raw bytes of these files as ngramfeatures, we have also extracted other features from the re-fined parts of the byte-code structure by using byte-codeengineering libraries.

Bertillonage Files: With the help of the Apache BCELlibrary, which is a Java byte-code engineering library, wehave generated corresponding Bertillonage [8] signature filesfor each class file. A formulation of an anchored Bertillonageclass signature ✓(C) for a class that does not have any innerclasses can be seen in Equation 1:

✓(C) = h�(C), h�(M1), · · · ,�(Mn)i, h�(F1), · · · ,�(Fn)ii (1)

While C represents the given class, Mn represents meth-ods, Fn represents fields included in that class, and � func-tion represents the fingerprints of these entities.

public class Dog extends Animal implements IFeed{

private int age;

public void bark(int n) {

// code

}

public boolean isHungry() {

// code

}

}�(C) = ”public class Dog extends Animal implements IFeed”�(F1) = ”private int age”�(M1) = ”public void bark(int n)”�(M2) = ”public boolean isHungry()”Anchored Class Signature :✓(C) = h�(C), h�(M1),�(M2)i, h�(F1)ii

In order to understand the fingerprint definitions of theseentities, the code above provides a simplified example. Asit can be seen, Bertillonage files that we have created in-clude only the information of package names, inheritance

and interface classes, method parameters, generics, modi-fiers and return types; but they do not contain any detailsabout method scopes.We have not only extracted ngram bytes from generated

Bertillonage files, but also analyzed code entity fingerprintsas features to understand infectious classes, methods andfields.Opcode Sequences: Another reverse engineering out-

put type that we have refined to use as ngram model is theopcode sequences of Java byte-code classes. We have gen-erated stripped opcode sequences by eliminating operandreferences of the opcode instructions.Constant Pool Strings: To understand which parts of

JVM byte-code structure are more valuable and informativefor malware detection, we have also generated text files forthe strings of the constant pools of Java classes.AndroidManifest.xml: Android manifest file contains

valuable information about an app such as permissions andintent actions to be used. Normally, AndroidManifest.xmlfile in apk package is a binary XML file. To gain insightsabout the features which will be extracted, we have firstlyconverted these files to human readable XML files by usingapktool.META-INF/MANIFEST.MF: MANIFEST.MF is a

meta file that contains the paths and SHA-1 digest valuesof all files in the apk package. We have used SHA-1 digestvalues for determining to what extent the multimedia filesin resource and assets folders and checksums of other filescan be used to detect malicious apps.

3.1.4 Feature Extraction

After generating the di↵erent reverse engineering outputfiles mentioned above, we have categorized the features ex-tracted under two models. In our first model, which is calledngram model, features are extracted as cascading byte se-quences from the files where the sequence length is n=4,n=3 or n=5. These byte sequences are not human readableor understandable, if they are extracted from the byte-codefiles such as dex, class or the files containing opcode se-quences. On the other hand, since these byte sequences canalso represent the ASCII chars, the extracted ngrams allowdeeper analysis if they belong to the text files such as smalifiles, Bertillonage signature files or the constant pool strings.In our second model, the signatures model, features are

represented as meaningful strings. We have used Bertillon-age fingerprints such as class, methods and fields definitions.Also, we have extracted permissions, intent messages andSHA-1 digests as features under the signatures model.

3.1.5 Feature Selection

Another configuration parameter of our experimental stageis the methodology used to reduce the number of featuresfor classifications. As a feature selection methodology, wehave started to experiment with the information gain scor-ing. Then, to prioritize more distinguishing features overcommon but less distinctive features, we have developed anovel feature selection methodology based on angular dis-tances of feature probabilities. Lastly, we have made somesorting modifications on both scoring techniques in orderto overcome unbalanced class representation problem of se-lected features.Information Gain: Information gain method, also called

average mutual information by Yang et al. [27], is firstly

Background Informationn-grams • Continuous sub-sequences of the given sequence • Extracted as file bytes • Can represent different things depending on the file types

Bertillonage Signatures [1]

Dataset CollectionMalware Set: • Requested from the authors of Drebin paper [2]. • 5560 malicious apk files from different families • Collected between August 2010 – October 2012 Benign Set: • 5560 apk files downloaded from Google Play • Checked by VirusTotal API • Ensures app category diversity • Same size fragmentation • Same date interval (August 2010 – October 2012)

[2] Arp, Daniel, et al. "Drebin: Effective and explainable detection of android malware in your pocket." Proceedings of the Annual Symposium on Network and Distributed System Security (NDSS). 2014.

5

Dataset PortionsExcessive Number of Experiments • 19 feature models (ngrams, signatures, file types) • 4 selection methodologies • 9 different numbers of selected features (features used in classifiers) • 4 classifier algorithms • 2736 combinations

Default Portion • Randomly selected 3600 apk files (1800 Ben./1800 Mal.) from whole set • Split with a ratio 66%-33% (2400 Training/1200 Validation) Unseen Families Portion • Same portion size and split ratio • 60/179 families reserved for validation, remaining ones for training

6

Experiment Stage

7

Feature Models

FeatureModel UniqueFeaturesDex:3-grams 7,025,126

Dex:4-grams 48,037,543

Class:3-grams 2,259,309Class:4-grams 11,412,157Smali:3-grams 370,346Smali:4-grams 1,588,163

Smali:5-grams 3,564,058Xml:3-grams 33,261Xml:4-grams 77,670Xml:5-grams 114,961

Bertillonage:3-grams 191,634Bertillonage:4-grams 666,718Opcodes:4-grams 217,745C.Pool:4-grams 266,926

Permissions,Intents 1,272ClassSignatures 173,316MethodSignatures 449,599FieldSignatures 639,683

SHA1-Digests 153,4338

9[3] Yang, Yiming, and Jan O. Pedersen. "A comparative study on feature selection in text categorization." ICML. Vol. 97. 1997.

adapted from text categorization to the malware detectiondomain by Kolter et al. [12] to rank the ngram features. Tocalculate the information gain score of a feature f by usingEquation 2

IG(f) =X

X2{f,f}

X

Ci

P (X,Ci) logP (X,Ci)

P (X)P (Ci)(2)

the term Ci represents the class of apps (benign/mali-cious), while the terms f and f stand for the existence or ab-sence of the given feature. Thus, P (X,Ci) can be expressedas the proportion of existence or absence of the feature f

for the given class Ci, while P (X) represents the proportionof apps in the whole set that contains or does not containfeature f , and P (Ci) is the proportion of apps that belongsto the given class Ci in the whole set.

Normalized Angular Distance: A new feature selec-tion method that we propose is based on the angular dis-tances of features in probability vector space where eachaxis represents class-conditional probabilities of the featuresexpressed as P (f |Ci) and defined as the proportion of theapps containing the feature f for the given class Ci.

Our selection method is essentially based on the presup-position that the features that have equal class likelihoodsare neither informative nor distinctive. Regardless of be-ing common or sparse in the the whole collection, if a fea-ture’s class-conditional probabilities are almost the same ortoo close to each other, this feature should not be includedin the representative sets of any classes. By drawing a di-agonal vector for equal class likelihoods of absolutely non-informative features as seen in Figure 2, we suppose thatthe distinguishing level of a feature can be scored accordingto the angular distance (✓) between this diagonal referencevector and the feature probability vector (~v(f)).

0 0.5 1.0

P(f|CM)

P(f|CB)

diagonal vector

v(f)

d

θ

1.0

0.5

Figure 2: Feature representation in class probabilityvector space

However, since this angle is determined by the ratio ofclass-conditional probabilities, very sparse features can havethe same angular distance as the frequent ones. i.e. A fea-ture with (0.005, 0.001) class probabilities will have the sameangular distance of a feature with (0.5, 0.1) probabilities. Toget rid of the noisy e↵ect of sparse features and to give moreweight for the frequent ones, as a normalization term, weuse the perpendicular distance (d) between the feature co-ordinates and the diagonal vector line for adjustment. Byselecting the distance (d) as a normalization term insteadof just using the feature vector length, we have intended torestrict the degree of normalization in a way that the im-portance of the angle would be irrevocable, since d is a termwhich is also dependent on the angle (d = sin ✓ ⇥ |~v|). The

formulation of our approach can be seen below:

NAD(f) = d1/k ⇥ ✓ (3)

d represents the perpendicular distance between the featurecoordinates and the diagonal line, and ✓ represents the an-gular distance between the feature vector and the diagonalvector in radians. Also, we have added k parameter to tunethe degree of normalization. As k value increases, the im-portance of angular distance rises as well which will resultin the increase of the weights of sparse features with greaterangular distances or vice versa. If k value decreases, it willfavor common features more, while the angular distance willbecome less important. During our experiments, we havepicked this parameter as k = 2 and experimented with onlythis value due to the time constraint of the study. However,the values within a range [0.5, 3] could be used in order to ad-just the selections according to the type of problem and thefeatures distribution. Table 10 (in the Appendix) demon-strates how this parameter can be used to tune the selectionpriorities. Additionally, since ✓ ⇡ sin ✓ is obtained withinthe range [0, ⇡

4 ], interchangeable usage of these terms doesnot e↵ect the selection distributions considerably.In our two-class problem, since the diagonal line equation

would be ax+by+c = 0 where terms a = 1, b = �1, c = 0 areobtained, and class-conditional probabilities of a feature forbenign P (f |CB) and malicious P (f |CM ) classes representthe (x, y) values of feature coordinates, the distance term d

and angle ✓ can be calculated as follows:

d =|P (f |CB)� P (f |CM )|

p2

(4)

✓ = sin�1 dp

P (f |CB)2 + P (f |CM )2(5)

Our aim of using normalized angular distance over infor-mation gain is that, as you can see from feature rankingsof both techniques in the Table 4, the information gain cangive more priority to a feature (600, 460), even when it isless distinctive but more common; whereas a more distinc-tive feature (40, 0) can be scored worse, if it is comparativelysparse. Our technique tries to balance this trade-o↵ in adi↵erent way by giving more importance to the distinctiveones, by taking the angular distance as a metric of beingdistinctive. On the other hand, since the normalized angu-lar distance does not favor frequent features as much as theinformation gain does, items of a feature set selected by ourapproach are not very likely to be found, which can result inbetter representation of information gain with fewer items.

0"

0.2"

0.4"

0.6"

0.8"

1"

0" 0.2" 0.4" 0.6" 0.8" 1"

P(f|CM)""

P(f|CB)""

IG"(20%"Selected)"

Unselected"

Selected"

0"

0.2"

0.4"

0.6"

0.8"

1"

0" 0.2" 0.4" 0.6" 0.8" 1"

P(f|CM)""

P(f|CB)""

NAD"while"k=2"(20%"Selected)"

Unselected"

Selected"

Figure 3: Selection distributions of information gainand normalized angular distance

domain by Kolter et al. [12] to rank the ngram features. Tocalculate the information gain score of a feature f by usingEquation 2

IG(f) =X

X2{f,f}

X

Ci

P (X,Ci) logP (X,Ci)

P (X)P (Ci)(2)

the term Ci represents the class of apps (benign/mali-cious) while the terms f and f stand for the existence or ab-sence of the given feature. Thus, P (X,Ci) can be expressedas the proportion of existence or absence of the feature f

for the given class Ci, while P (X) represents the proportionof apps in the whole set that contains or does not containfeature f , and P (Ci) is the proportion of apps that belongsto the given class Ci in the whole set.

Normalized Angular Distance: A new feature selec-tion method that we propose is based on angular distancesof features in probability vector space where each axis repre-sents class-conditional probabilities of the features expressedas P (f |Ci) and defined as the proportion of the apps con-taining the feature f for the given class Ci.

Our selection method is essentially based on the presup-position that the features that have equal class likelihoodsare neither informative nor distinctive. Regardless of be-ing common or sparse in the the whole collection, if a fea-ture’s class-conditional probabilities are almost the same ortoo close to each other, this feature should not be includedin the representative sets of any classes. By drawing a di-agonal vector for equal class likelihoods of absolutely non-informative features as seen in Figure 2, we suppose thatthe distinguishing level of a feature can be scored accordingto the angular distance (✓) between this diagonal referencevector and the feature probability vector (~v(f)).

0 0.5 1.0

P(f|CM)

P(f|CB)

diagonal vector

v(f)

d

θ

1.0

0.5

Figure 2: Feature representation in class probabilityvector space

However, since this angle is determined by the ratio ofclass-conditional probabilities, very sparse features can havethe same angular distance as the frequent ones. i.e. A fea-ture with (0.005, 0.001) class probabilities will have the sameangular distance of a feature with (0.5, 0.1) probabilities.To get rid of noisy e↵ect of sparse features and to give moreweight for the frequent ones, as a normalization term, weuse the perpendicular distance (d) between the feature co-ordinates and the diagonal vector line for adjustment. Byselecting the distance (d) as a normalization term insteadof just using the feature vector length, we have intended torestrict the degree of normalization in a way that the im-portance of the angle would be irrevocable since d is a termwhich is also dependent on the angle (d = sin ✓ ⇥ |~v|). Theformulation of our approach can be seen below:

d represents the perpendicular distance between the fea-ture coordinates and the diagonal line, and ✓ represents theangular distance between the feature vector and the diag-onal vector in radians. Also, we have added k parameterto tune the degree of normalization. As k value increases,the importance of angular distance rises as well which willresult in the increase of the weights of sparse features withgreater angular distances or vice versa, if k value decreases,it will favor common features more while the angular dis-tance will become less important. During our experiments,we have picked this parameter as k = 2 and only exper-imented with this value due to the time constraint of thestudy. However, the values within a range [0.5, 3] could beused in order to adjust the selections according to the typeof problem and the features distribution. Table 10 (in theAppendix) demonstrates how this parameter can be used totune the selection priorities. Additionally, since ✓ ⇡ sin ✓is obtained within the range [0, ⇡

4 ], interchangeable usage ofthese terms will not e↵ect the sorting considerably.In our two-class problem, since the diagonal line equation

would be ax+by+c = 0 where terms a = 1, b = �1, c = 0 areobtained, and class-conditional probabilities of a feature forbenign P (f |CB) and malicious P (f |CM ) classes representthe (x, y) values of feature coordinates, the distance term d

and angle ✓ be can calculated as follows:

NAD(f) = d1/k ⇥ ✓ (3)

d =|P (f |CB)� P (f |CM )|

p2

(4)

✓ = sin�1 dp

P (f |CB)2 + P (f |CM )2(5)

Our aim of using normalized angular distance over infor-mation gain is that, as you can see from feature rankingsof both techniques in the Table 4, the information gain cangive more priority to a feature (600, 460), even when it isless distinctive but more common; whereas a more distinc-tive feature (40, 0) can be scored worse, if it is comparativelysparse. Our technique tries to balance this trade-o↵ in adi↵erent way by giving more importance to the distinctiveones, by taking the angular distance as a metric of beingdistinctive. On the other hand, since the normalized angu-lar distance does not favor frequent features as much as theinformation gain does, items of a feature set selected by ourapproach are not very likely to be found which can result inbetter representation of information gain with fewer items.To understand the di↵erence between two approaches in

a more comprehensible way, Figure 3 presents the distribu-tions of selections by both techniques for a fraction 20%.

0"

0.2"

0.4"

0.6"

0.8"

1"

0" 0.2" 0.4" 0.6" 0.8" 1"

P(f|CM)""

P(f|CB)""

IG"(20%"Selected)"

Unselected"

Selected"

0"

0.2"

0.4"

0.6"

0.8"

1"

0" 0.2" 0.4" 0.6" 0.8" 1"

P(f|CM)""

P(f|CB)""

NAD"while"k=2"(20%"Selected)"

Unselected"

Selected"

Figure 3: Selection distributions of information gainand normalized angular distance

Table 4: Ranking comparison of feature selectionmethodologies. |A|=700 and |B|=700

Rank IG FreqA FreqB NAD FreqA FreqB

1 A 600 460 C 80 102 B 300 150 D 40 03 C 80 10 E 160 604 D 40 0 B 300 1505 E 160 60 H 20 36 G 40 13 F 10 07 H 20 3 G 40 138 F 10 0 A 600 4609 I 2 0 I 2 0

10 J 5 2 J 5 2

The sub-figures represent the feature spaces with possibleprobability coordinates where the selected features are rep-resented by darker points. Since the information gain giveequal importance for the existence of feature in a class andthe absence of that feature in the counter-class, as it can beseen from the fraction border points of information gain dis-tribution, a more frequent but less distinctive feature (0.44,0.96) can be selected whereas a feature (0.43, 0), which ismore distinctive but less common, may not be selected whilelatter types of features have more priority in normalized an-gular distance.

Figure 4: Probability distributions of dex 4-gramsand XML 5-grams (logarithmic scale)

Class-wise Selections: Due to unbalanced probabilitydistributions of the most of feature models, where exam-ples of ngrams extracted from classes.dex and AndroidMan-ifest.xml files can be seen in Figure 4, feature sets selectedby information gain or our approach can be dominated bythe features that belong to one of the classes which is mostlythe malicious.

Reddy et al. [17] have proposed a class-wise selection ap-proach to solve this issue by creating separate two lists thatconsist of sorted features according to the document frequen-cies of features for each class. Then, they have merged thesetwo lists by eliminating the duplicates in order to create a fi-nal feature set that can represent both classes. Even thoughthis approach is better to select features that can representboth benign and malicious classes, it is still vulnerable tothe selection of non-informative features which are equallyfrequent in both classes.

To solve this issue we propose a more practical solutionfor our study which is similar to Reddy et al.’s class-wiseapproach. Like their solution, we also constitute our finallist from the ranked lists of both classes. However, our classlists are still ordered according to the descending informa-tion gain or the normalized angular distance scores. In otherwords, one of the lists consists of benign representative fea-

tures, where the existence probabilities of the features forbenign class is higher than malware class. On the otherhand, the other list is made up of malware representativefeatures, where the existence probabilities of the features formalware class higher than benign class. In order to clarify,example class-wise scoring formulation of information gainfor both classes can be seen in Equations 8 and 9:

CWIGB(f) =

(IG(f), if P (f |CB) > P (f |CM )

0, otherwise(6)

CWIGM(f) =

(IG(f), if P (f |CM ) > P (f |CB)

0, otherwise(7)

CWNADB(f) =

(NAD(f), if P (f |CB) > P (f |CM )

0, otherwise(8)

CWNADM(f) =

(NAD(f), if P (f |CM ) > P (f |CB)

0, otherwise(9)

3.1.6 Classifications

After creating ranked lists for di↵erent feature types ex-tracted from the reverse engineering products of trainingapps by using the selection methodologies mentioned above,we have created binary feature vectors for each app that rep-resent the absence or existence of the selected features forthe given app. We have created these vectors with varyinglengths such as 100, 250, 500, 750, 1000, 2000, 3000, 5000,8000 to find the best performing number of features. Then,by using these binary vectors, we have generated separatedata files (*.ar↵) for the training and validation sets to per-form our classifications by exploiting the algorithms belowprovided by Weka library.Naive Bayes (NB): We have used the Naive Bayes clas-

sifier, which is one of the probabilistic classifiers based onBayes’ theorem and a very e�cient solution for high num-ber of features. It classifies the unknown instances by usingposterior probabilities of each class calculated from knowninstances. This algorithm relies on the assumption that thefeatures are independent from each other.Instance Based Learner (IBk): A lazy learning al-

gorithm also known as k-nearest neighbors is another algo-rithm we have exploited. Instead of modeling an explicitgeneralization, this technique firstly finds the most similar ktraining instances to the given unknown instance, then clas-sifies the unknown instance according to the majority classlabels of the similar training instances. During our experi-ments we have adjusted k value as k = 5, which has yieldedbetter results for the majority of configurations.Support Vector Machines (SVM):We have also made

classifications with Support Vector Machines which worksas a linear classifier, after implicitly mapping the featuresto high-dimensional space by kernel functions. We have runour experiments with the sequential minimal optimization(SMO) function of Weka and default poly kernel settings.Random Forest (RF): Random Forest, an ensemble

learning technique, works as a collection of many decisiontrees. It relies on the idea of generating multiple decisiontrees from randomly selected subsets of a training set to con-tribute to the final decisions with controlled variance. Wehave used this technique with its default settings on Wekawhere the number of trees is 100.

0 0.5 1.0

P(f|CM)

P(f|CB)

diagonal vector

v(f)

d

θ

1.0

0.5

Information Gain (IG) [3]

Normalized Angular Distance (NAD)

Classwise Selections (CWIG, CWNAD)

Feature Selection

c = (0.005, 0.001) d = (0.5, 0.1) 𝜃(c) = 𝜃(d)

a = (0.96, 0.44) b = (0.43, 0.00)IG(a) > IG(b)

Classifier ParametersNumber of Features • Binary feature vectors are created for each app (0,1,0,0…1,0,1,Benign) • Vector represents the existence and absence of selected features • Experiments for 100, 250, 500, 750, 1000, 2000, 3000, 5000, 8000

features

Classifier Algorithms • Random Forest (RF) with 100 trees • Support Vector Machines (SVM) with SMO poly kernel • Instance Based Learner(IBk) while k=5 • Naïve Bayes (NB)

10

Highlighting Results• More accurate results as the number of features increases (excluding NB) • Random Forest (RF) outperformed other classifiers • Followed by Support Vector Machines (SVM) and Instance Based Learner (IBk,

k=5) • Naive Bayes (NB) is far from being competitive

11

Highlighting Results• NAD competes with IG and outperforms for specific configurations • IG slightly performed better on average and advantageous for less number of

features. • CW selections performed better against unseen malware families

12

Ngrams• Performed better than signatures models • Redundancy is the biggest problem

Dex and Class Files • Dex bytes yielded more accurate results • Scalability advantage of class bytes Smali Files • Similar results to class files • Since file bytes represents ASCII chars, they allow deeper analysis AndroidManifest.xml • Competes with dex bytes • Great scalability advantage over dex bytes

Refined Structures • Standalone opcode sequences and constant pool ngrams performed similarly • Bertillonage ngrams without methods scopes produced promising results

13

Signatures• Insights about infectious elements • Redundancy and scalability

advantages

Bertillonage Fingerprints • Methods and fields yielded more

accurate results • Class fingerprints identify infectious

packages Permission-based Signatures • No such big difference compared to

ngrams of Android Manifest.xml SHA-1 digests • Sparse but class-exclusive features • Surprising results with a 92.28% ratio

classes and packages.All in all, for RQ3, all of these signatures types can be

used to detect malicious apps with varying accuracy lev-els, though the permission-based signatures yield the mostpromising results. But the fingerprint signatures are com-paratively more helpful to address and locate infectious codeentities, even though they cannot perform the same accuracylevels.

4.2.1 Infectious Signatures

Features extracted under the signature model have pro-vided us valuable insights about the classes, methods andfields infecting the apps, while the Android manifest file putsforward starring permissions and SHA-1 digests refer to themalicious files, where small portions of ranked methods, per-missions and file digests can be seen in Table 6.

As the key findings from the ranked classes suggest, morethan 11% of malicious apps are infected by the ApperhandSDK package (com.apperhand.*), while another 10% su↵erfrom an aggressive adware library (com.airpush.android.*).Also, by 10% encounter rate, majority of malware devel-opers use the JSON tools provided by org.codehaus.jacksonlibrary, whereas usage of this library in benign class is lessthan 1%. Another remarkable point that we have noticedfrom the features selected by class-wise selection techniquesis that the usage of com.google.ad.* package is accepted asa strong sign of being benign.

Table 6: Highlighting signaturesRank Methods ranked by NAD Ben. Mal.

2 public static String a(Object arg0) 0.4% 16.6% ≠23 public int getVersionSDKInt() 0.0% 11.8% ¨48 public String getSourceIp() 0.0% 11.8% ¨

581 public void setDeviceId(String arg1) 1.1% 13.3% ¨862 protected transient String doInBackground(String[] arg1) 3.4% 19.3%

2203 private void sendSms() 0.2% 9.3% ¨3830 public String getDeviceId() 3.3% 13.6% ¨3848 public static String getTelephoneNumber() 0.0% 4.9% ¨4912 public static String getImei(android.content.Context arg0) 0.1% 4.7% ¨5167 public static void clearMemory() 0.0% 4.2%

Permissions ranked by IG

1 android.permission.READ PHONE STATE 17.3% 89.7% Æ2 android.permission.SEND SMS 1.6% 53.5% Æ5 android.permission.READ SMS 0.8% 36.6% Æ7 android.permission.ACCESS WIFI STATE 6.5% 43.8% Æ

15 android.permission.INSTALL PACKAGES 0.3% 16.7%20 android.permission.READ CONTACTS 4.4% 24.3% Æ39 android.intent.action.PHONE STATE 1.2% 11.4% Æ45 android.permission.READ LOGS 1.0% 8.8%50 android.appwidget.action.APPWIDGET UPDATE 10.5% 2.7%

100 android.permission.ACCESS GPS 0.3% 1.8%SHA-1 digests ranked by NAD

1 org/codehaus/jackson/impl/VERSION.txt 0.0% 10.3%4 assets/t1.png 0.0% 9.9% Ø

13 lib/armeabi/libnative.so 0.0% 9.0% ∞18 lib/armeabi/libandroidterm.so 0.0% 5.1% ∞70 assets/ad 480.html 0.0% 4.0%76 res/drawable/btn green pressed.9.png 0.0% 3.7% Ø

192 AndroidManifest.xml 0.0% 2.7% ±255 assets/bugsense-trace.jar 0.0% 2.5%272 classes.dex 0.0% 2.3% ≤

1002 res/drawable-hdpi/en smsmyway rate.png 0.0% 0.8% Ø

Apart from the class signatures, we have observed thatranked method signatures mostly start with get or set pre-fixes that try to access critical information ¨ such as ver-sions IP, imei, device ID, SDK version, SMS or phone num-ber ; while there exists a big portion of meaningless methodnames ≠ which can be a strong sign of obfuscation. Likethe method signatures, the insights that can be gained fromthe field names are similar. Although the field and methodsignatures have yielded better results, the frequency frag-mentation of these features indicates that majority of themoriginate from the same packages that we have mentionedabove. Therefore, this situation suggests a redundancy prob-lem for these features.

Besides, permission-based features extracted from Android-

Manifest.xml file indicates that while the requests of specificpermissions such as access for contacts, logs, SMS messages,WI-FI and phone states or other critical settings Æ can bestrong indicators of malicious behavior, defining benign fea-tures is comparatively harder, due to the feature distribu-tions (see Figure 12 in the Appendix) dominated by mali-cious class. For instance, existence of five benign representa-tive features out of a hundred top ranked ones can help us tounderstand the distribution of class features. One of the ad-vantages of using AndroidManifest.xml signatures is that itis possible to send all the extracted features to the classifiersdue to the limited number of features. One last benefit isthat there is not any redundancy among the parsed features,in contrast to the redundancy in ngram bytes of Android-Manifest.xml.Last signature type that we have used during our experi-

ments, SHA-1 file digests, has provided us sparse but moreexclusive malicious features. These digests mostly point outresource and image files Ø commonly used among maliciousapps, also they can catch some native C libraries ∞ such aslibnative.so and libandroidterm.so files. Another interestingfinding of SHA-1 digests about the Drebin set is that 28 outof 1200 malicious apk files have the same exact classes.dexfile ≤, while 32 apk files contain the same AndroidMani-fest.xml file ± despite di↵erent checksums of these apk files.This is probably caused by the simple trick of repackag-ing with redundant files to get over anti-virus mechanismschecking only the checksum values of apk files.

4.3 Selection MethodologiesSince the results yielded by di↵erent selection methodolo-

gies are very close to each other and there are other ex-periment parameters that vary the performance of selectionmethodologies, it is hard to determine an absolute winnerfor RQ4 that will be valid for all configurations.Although our selection method normalized angular dis-

tance has shown that it can compete with the informationgain scoring for the picked normalization degree (k = 2) andoutperform for some specific configurations, we can concludethat the information gain is a better option to get moreaccurate results especially with fewer features; since morecommon features are favored more by the information gain,hence it increases the accuracy for smaller feature sets.On the other hand, class-wise selections of these two scor-

ing techniques have not made an improvement on accuracyresults of the classifiers, except for NB classifier and the ex-periments performed with unseen malware families. Averageaccuracy results of all the feature models and their signifi-cance test results performed on the feature model pairs forspecific classifier configurations are given in Tables 7 and 8of which a broader range and number of configurations canbe seen in the Appendix.

Table 7: Average accuracy results of the featuremodels based on selection methodologiesConfig. Dataset CW-IG IG NAD CW-NAD

RF-8000 default 94.33% 94.50% 94.41% 94.09%IBk-5000 default 89.86% 90.93% 90.45% 89.92%SVM-2000 default 90.74% 91.45% 91.13% 90.47%NB-500 default 81.16% 74.36% 71.91% 79.82%SVM-5000 unseen fam. 86.02% 85.40% 84.18% 85.16%RF-1000 unseen fam. 85.19% 83.19% 81.55% 82.86%IBk-500 unseen fam. 81.99% 79.60% 78.80% 79.45%NB-250 unseen fam. 77.03% 72.85% 73.35% 76.27%

14

Comparison and Future Work• 98.33% accuracy for the combination of dex 4-grams and XML

5-grams • Outperformed the Drebin results with a smaller portion • Experiment with the whole dataset • Attack on the redundancy problem of ngrams

• Tune normalization degree of angular distance and adapt it for multi-class problems

15

Table 8: Significance test results of selectionmethodologies (↵=0.05)Config. Dataset (IG=NAD) (Naive=Classwise)

RF-8000 default do not reject rejectIBk-5000 default do not reject rejectSVM-2000 default do not reject rejectNB-500 default reject rejectSVM-5000 unseen fam. do not reject do not rejectRF-1000 unseen fam. do not reject do not rejectIBk-500 unseen fam. do not reject rejectNB-250 unseen fam. do not reject reject

4.4 Unseen Malware FamiliesIn the experiments that we have performed with the val-

idation set consisting of unseen malware families, the ac-curacy results have dropped significantly with varying de-grees for di↵erent feature models and classifier configura-tions where the drop ratios for the best performing con-figuration of the default dataset (RF-8000) can be seen inFigure 7. During these experiments, we have observed thatthe dex ngram models are still the most accurate outputtypes, followed by the ngram bytes of smali and XML fileswhich have defeated the class bytes that performed worse re-siliency. Moreover, for unseen malware families, the permission-based signatures of AndroidManifest.xml has yielded slightlybetter or similar results to the ngram bytes of the file.

In terms of the feature selection model, class-wise selec-tions have slightly outperformed their naive correspondingalternatives on average. This outcome reasonably showsthat since the feature sets selected by naive techniques aredominated by malicious features and contain features spe-cific to the malware families seen in the training data, the ac-curacy results have dropped more for them; while the class-wise selection performed better because of the benign fea-tures that are independent from malware families. Detailedaccuracy results for di↵erent feature models and configura-tions can be found in the Appendix.

Thus, we can conclude for RQ5 that while ngrams of dexand smali bytes yield better results, class-wise selections takeadvantage of benign features against the unseen malwarefamilies.

Table 9: The comparison results of related studiesStudy Ben./Mal. Alg. Acc. F1 AUC

Shabai et. al [22] 1878/407 Bayesian 91.80% N/A 0.945Yerima et. al [28] 1000/1000 Bayesian 93.00% N/A 0.977DroidMat [25] 1500/238 k-NN 97.87% 0.918 N/ADrebin [5] 124,453/5560 SVM 93.90% N/A N/AOur study 3600/3600 RF 98.33% 0.982 0.996

4.5 Comparison with Related StudiesAlthough we have used a small portion of its dataset, we

have outperformed the results of the Drebin paper for mostof our feature models that we have experimented. Eventhough standalone ngram models of opcode sequences andconstant pool strings, and the signature models consistingof fields, method fingerprints and file digests could not ex-ceed the 94% ratio of the Drebin paper, their results arestill competitive and promising. Taking into considerationthat the Drebin paper and most of the other studies rely-ing on static analysis techniques combine code-based andpermission-based features, we have also combined our bestperforming ngrammodels which are the 4-grams of classes.dexand 5-grams of AndroidManifest.xml files, in order to make

0.0000# 0.0500# 0.1000# 0.1500# 0.2000# 0.2500# 0.3000#

Bert.#3,grams#

Bert.#4,grams#

Bert.#Classes#

Bert.#Fields#

Bert.#Methods#

C.Pool#4,grams#

Class#3,grams#

Class#4,grams#

Opcode#4,grams#

Dex#3,grams#

Dex#4,grams#

SHA,1#Digests#

XML#3,grams#

XML#4,grams#

XML#5,grams#

XML#Permission#

Smali#3,grams#

Smali#4,grams#

Smali#5,grams#

Drop#RaGos#in#Accuracy#Results#against#Unseen#Malware#Families#(RF#,#8000)#

IG#

NAD#

Figure 7: Drop ratios in accuracy results againstunseen malware families (Accuracy di↵erence is di-vided by the accuracy of default portion)

it comparable with these studies. The results of this com-bined model and similar studies can be seen in Table 9.

5. CONCLUSION AND FUTURE WORKIn this study, we have presented an experimental frame-

work to investigate the usage of di↵erent reverse engineer-ing files with varying feature models and feature selectiontechniques for the detection of malicious Android apps. Ourexperiment results have shown that the ngram bytes of thesefiles yield better accuracy results than the meaningful signa-ture strings extracted from them. Among a range of reverseengineering products, classes.dex and AnroidManifest.xmlfiles with 4-grams and 5-grams models have become the bestperforming feature models, the combination of which hasyielded 98.33% accuracy for Random Forest classifier and8000 features.Moreover, we have proposed a novel feature selection method-

ology that can compete with the information gain scoring byprioritizing the distinguishing level of a feature over its doc-ument frequency. Additionally, by modifying these two tech-niques with a class-wise approach, the classifications againstunseen malware families have gained an advantage over thenaive ones.Due to the time constraints of this study, we have per-

formed our experiments with portions of the Drebin datasetthat corresponds to 30% of the whole dataset. Addition-ally, we have only experimented for the normalized angulardistance with a normalization degree of k = 2. For futurework, we plan not only to perform these experiments withthe whole Drebin dataset, but also to attack the redundancyproblem of our feature models. Lastly, by adjusting the nor-malization degree (k), we aim to adapt our normalized an-

Family Distributions

16

17

Selection Distributions

Thank you for your attention

Q&A

ngram and signature based malware detection in android platform

Education