model checking for mobile android malware evolution · malware evolution and to find...

7
Model Checking for Mobile Android Malware Evolution Aniello Cimitile * , Francesco Mercaldo , Fabio Martinelli , Vittoria Nardone * , Antonella Santone * , Gigliola Vaglini * Department of Engineering, University of Sannio, Benevento, Italy {cimitile, vnardone, santone}@unisannio.it Institute for Informatics and Telematics, National Research Council of Italy (CNR), Pisa, Italy {fabio.martinelli, francesco.mercaldo}@iit.cnr.it Department of Information Engineering, University of Pisa, Pisa, Italy [email protected] Abstract—Software engineering researchers have largely demonstrated that newer versions of software make use of previous versions. No exception to this rule for the so-called malicious software, that frequently evolves in order to evade the detection by antimalware. As matter of fact, mobile malicious programs, such as trojans, are frequently related to previous malware through evolutionary relationships. Discovering those relationships and constructing a phylogenetic model is expected to be helpful for analyzing new malware and for establishing a principled naming scheme. In this paper we propose a model checking based method to infer mobile malware phylogenetic trees. We demonstrate, implementing our approach in the droid- Sapiens tool, that mobile malware families come from an ancestor and they influence own descendant, basing on the payload that they exhibit. Keywords-security; malware; evolution; phylogenesys; model checking I. I NTRODUCTION Systematic code reuse has been an elusive goal for software engineering practice since the term was first coined: it repre- sents the use of existing software, or software knowledge, to build new software, following the reusability principles [10]. Malware, as any software, evolves. However new malware is seldom created from scratch: more frequently they are generated, with the help of automatic tools [23]), using third- party libraries and code borrowed from existing malicious software [14] that is accessible by a robust network for malware code exchange. In last years, we are witnessing the growing phenomenon of mobile apps, i.e., applications developed to run on devices such as smartphones or tablets. At June 2016, there were 2.2 million available apps at Google Play store and two billion apps available in the Apple’s App Store 1 . As matter of fact, with particular regards to mobile envi- ronment, malware frequently evolves due to rapid modify- and-release cycles, creating numerous strains of a common form [9], [22]. The result is a tangled network of derivation relationships between malicious programs. In biology such a relationship network is called a “phylogeny”; significant 1 https://www.statista.com/topics/1002/mobile-app-usage/ recent efforts in bioinformatics involve automatically con- structing meaningful phylogeny models based on information in nucleotide, protein, or gene sequences [18]. Reconstructing malware phylogenies using automatic techniques is expected to help in forensic malware analysis [13]. It could provide clues for the analyst, particularly in terms of understanding how new samples relate to previously seen samples. Useful phylogenies could also serve as a principled basis for naming malware. Many efforts have been performed by antimalware vendors to protect users from the mobile threats. The main problem related to signature-based malware detection is rep- resented by the fact that a signature must be contained in the antimalware repository in order to recognize the threats. However, in the frenetic world of mobile malware due to the rapid development of new releases of infections every day, it is impossible to avoid the spread of infections considering that the signature extraction process is laborious and the new sample must be widespread in order to be analyzed and added to repository. This is a very appealing scenario for malware writers, that develop code more and more aggressive to infect users, modifying the malicious action (i.e., the so- called payload) to continuously evade the detection provided by current antimalware technologies making useless the ex- tracted signature. Starting from these considerations, in order to help the mobile malware analyst to efficiently identify the relationship between mobile malware samples to reduce the signature extraction time, in this paper we propose a model checking based approach to build evolutionary relationship between Android malware families. A malware family groups together a set of malware programs with common behaviors and properties that are common to all its member [27]. We infer a phylogenetic tree by analyzing system call traces extracted using dynamic analysis, because malware authors usually try to hide the derivation relationships adopting several techniques, including garbage insertion, code reordering, and instruction substitution that typically invalidate static analysis [20], [19] not dynamic one [5]. From the system call traces we generate a formal model used to verify properties char- acterizing the malware behavior. In this paper we use the Calculus of Communicating Systems (CCS) [21] to specify the model and the selective mu-calculus logic [3] to express

Upload: others

Post on 10-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Model Checking for Mobile Android Malware Evolution · malware evolution and to find ancestor-descendant relation-ships between malware samples implemented in the droid-Sapiens tool

Model Checking for Mobile AndroidMalware Evolution

Aniello Cimitile∗, Francesco Mercaldo†, Fabio Martinelli†, Vittoria Nardone∗, Antonella Santone∗, Gigliola Vaglini‡∗Department of Engineering, University of Sannio, Benevento, Italy

{cimitile, vnardone, santone}@unisannio.it†Institute for Informatics and Telematics, National Research Council of Italy (CNR), Pisa, Italy

{fabio.martinelli, francesco.mercaldo}@iit.cnr.it†Department of Information Engineering, University of Pisa, Pisa, Italy

[email protected]

Abstract—Software engineering researchers have largelydemonstrated that newer versions of software make use ofprevious versions. No exception to this rule for the so-calledmalicious software, that frequently evolves in order to evade thedetection by antimalware. As matter of fact, mobile maliciousprograms, such as trojans, are frequently related to previousmalware through evolutionary relationships. Discovering thoserelationships and constructing a phylogenetic model is expectedto be helpful for analyzing new malware and for establishing aprincipled naming scheme. In this paper we propose a modelchecking based method to infer mobile malware phylogenetictrees. We demonstrate, implementing our approach in the droid-Sapiens tool, that mobile malware families come from an ancestorand they influence own descendant, basing on the payload thatthey exhibit.

Keywords-security; malware; evolution; phylogenesys; modelchecking

I. INTRODUCTION

Systematic code reuse has been an elusive goal for softwareengineering practice since the term was first coined: it repre-sents the use of existing software, or software knowledge, tobuild new software, following the reusability principles [10].

Malware, as any software, evolves. However new malwareis seldom created from scratch: more frequently they aregenerated, with the help of automatic tools [23]), using third-party libraries and code borrowed from existing malicioussoftware [14] that is accessible by a robust network formalware code exchange. In last years, we are witnessingthe growing phenomenon of mobile apps, i.e., applicationsdeveloped to run on devices such as smartphones or tablets.At June 2016, there were 2.2 million available apps at GooglePlay store and two billion apps available in the Apple’s AppStore1.

As matter of fact, with particular regards to mobile envi-ronment, malware frequently evolves due to rapid modify-and-release cycles, creating numerous strains of a commonform [9], [22]. The result is a tangled network of derivationrelationships between malicious programs. In biology sucha relationship network is called a “phylogeny”; significant

1https://www.statista.com/topics/1002/mobile-app-usage/

recent efforts in bioinformatics involve automatically con-structing meaningful phylogeny models based on informationin nucleotide, protein, or gene sequences [18]. Reconstructingmalware phylogenies using automatic techniques is expectedto help in forensic malware analysis [13]. It could provideclues for the analyst, particularly in terms of understandinghow new samples relate to previously seen samples. Usefulphylogenies could also serve as a principled basis for namingmalware. Many efforts have been performed by antimalwarevendors to protect users from the mobile threats. The mainproblem related to signature-based malware detection is rep-resented by the fact that a signature must be contained inthe antimalware repository in order to recognize the threats.However, in the frenetic world of mobile malware due to therapid development of new releases of infections every day,it is impossible to avoid the spread of infections consideringthat the signature extraction process is laborious and the newsample must be widespread in order to be analyzed andadded to repository. This is a very appealing scenario formalware writers, that develop code more and more aggressiveto infect users, modifying the malicious action (i.e., the so-called payload) to continuously evade the detection providedby current antimalware technologies making useless the ex-tracted signature. Starting from these considerations, in orderto help the mobile malware analyst to efficiently identify therelationship between mobile malware samples to reduce thesignature extraction time, in this paper we propose a modelchecking based approach to build evolutionary relationshipbetween Android malware families. A malware family groupstogether a set of malware programs with common behaviorsand properties that are common to all its member [27]. Weinfer a phylogenetic tree by analyzing system call tracesextracted using dynamic analysis, because malware authorsusually try to hide the derivation relationships adopting severaltechniques, including garbage insertion, code reordering, andinstruction substitution that typically invalidate static analysis[20], [19] not dynamic one [5]. From the system call traceswe generate a formal model used to verify properties char-acterizing the malware behavior. In this paper we use theCalculus of Communicating Systems (CCS) [21] to specifythe model and the selective mu-calculus logic [3] to express

Page 2: Model Checking for Mobile Android Malware Evolution · malware evolution and to find ancestor-descendant relation-ships between malware samples implemented in the droid-Sapiens tool

properties. The results of the verification help to identify thecommon malware behavior shared by the malware samples andto infer the ancestor-descendant relationships between them.The premise of our work is that an expanded perspective onmobile malware behavior and in particular the relationshipsbetween malware families will lead the development of moreeffective countermeasures.

The rest of the paper is organized as follows: the nextsection illustrates the proposed approach to track phylogenetictree related to mobile malware; the third section discusses theevaluation; the fourth section discusses the current literatureabout malware phylogenesis and, finally, conclusions are givenin the last section.

II. THE METHOD

In this section we present our methodology for the analysisand detection of Android malware evolution using modelchecking. We suppose the reader is familiar with the Calculusof Communicating Systems (CCS) [21] and model checking.We briefly recall only the selective mu-calculus logic.

A. Selective Mu-Calculus Logic

In this paper we use the selective mu-calculus, introduced in[3], which is a branching temporal logic to express behavioralproperties of systems. The syntax of the selective mu-calculusis the following, where K and R range over sets of actions,while Z ranges over a set of variables:

φ ::= tt | ff | Z | φ ∨ φ | φ ∧ φ | [K]R φ | 〈K〉R φ | νZ.φ | µZ.φ

The satisfaction of a formula φ by a state s of a transitionsystem, written s |= φ, is defined as follows: each statesatisfies tt and no state satisfies ff; a state satisfies φ1 ∨ φ2(φ1∧φ2) if it satisfies φ1 or (and) φ2. [K]R φ and 〈K〉R φ arethe selective modal operators: they require that the formula φis verified after the execution of an action of K, provided thatit is not preceded by any action in K ∪ R. More precisely:(i) [K]R φ is verified by a behavior expression which, forevery performance of a sequence of actions not belonging toR ∪ K, followed by an action in K, evolves in a behaviorexpression obeying φ. (ii) 〈K〉R φ is verified by a behaviorexpression which can evolve to a behavior expression obeyingφ by performing a sequence of actions not belonging to R∪Kfollowed by an action in K.

As in standard mu-calculus, µZ.φ is the least fixpoint ofthe recursive equation Z = φ, while νZ.φ is the greatest one.

B. Model Checking for Malware Evolution

We propose a formal based methodology to infer mobilemalware evolution and to find ancestor-descendant relation-ships between malware samples implemented in the droid-Sapiens tool. The tool is freely available by contacting oneof the authors. The workflow of the method that we are goingto propose is summarized in Fig. 1. We now analyze eachprocess of our methodology in details.Generation of system call execution traces. The aim ofthis process is to capture and store, in a textual format,

the system call traces generated by Android applications. Tothis aim, the APK is installed and started on an Androiddevice emulator. Successively, the BOOT COMPLETED eventis generated and sent to the emulator more that once andthe correspondent sequence of system calls is gathered. Weselect this event because is able to activate the majority ofwidespread malware payloads [27]. The process is handled bya set of shell scripts developed by authors that perform thefollowing actions for each application: (i) starting the targetAndroid device emulator; (ii) installing the application onthe device emulator; (iii) waiting until a stable state of thedevice is reached (i.e., when epoll wait is executed and theapplication waits for user input or a system event to occur);(iv) starting the capture of system call traces; (v) sending theBOOT COMPLETED event to the application; (vi) capturingsystem calls made by the application until a stable state isreached; (vii) stopping the system call capture and save thecaptured system call trace; (viii) stopping the Android deviceand revert its disk to a clean baseline snapshot.XES-based Event Stream Generation. The process aimsto clean, filter and convert the system call traces collectedin textual format in the previous process into eXtensibleEvent Stream (XES)-compliant log format (IEEE XML-basedstandard for event logs2). The extracted traces are in a textualformat that needs to be cleaned, filtered and converted to anXES event stream. During this conversion, all the unnecessaryinformation, as the system call arguments, is filtered out.Property based reduction. In this process we exploit thepowerful of the selective mu-calculus logic to reduce the tracesexpressed in XES format. The behavior of a system expressedwith the execution traces is observed in an “abstract” way,disregarding all the non-interesting actions occurring in thetraces. This abstraction can be used, in the next process, as amethod for efficiently verifying properties of formal models. Infact, while properties are usually checked on transition systemsdescribing the complete behavior of a process, many propertiesdepend only on the behavior of the system with respect toa small subset of actions: thus, they can be more efficientlychecked on a transition system which contains only the actionsin this subset and which behaves, with respect to them, as theoriginal one. The abstraction is easily defined since, given aselective mu-calculus logic formula ϕ, expressing the propertywe want to verify, the set of interesting actions, called ρ,is composed of all the actions syntactically occurring in ϕ.The actions in ρ are the only actions relevant for checkingthe formula ϕ and different transition systems that behave inthe same way w.r.t. ρ satisfy all the formulae with occurringactions in ρ. Thus, given ρ and a set of traces, the abstractionfunction deletes the non-interesting actions from the traces;the reduced transition system we produce maintains, w.r.t. ρ,the behavior of the original system.Model Discovery. From the abstracted traces we generatea formal model, i.e., a CCS process, that can be used toanalyze and infer the evolution of mobile malware. We use a

2http://www.xes-standard.org/.

Page 3: Model Checking for Mobile Android Malware Evolution · malware evolution and to find ancestor-descendant relation-ships between malware samples implemented in the droid-Sapiens tool

Fig. 1. The droidSapiens workflow.

general syntactic transformation function T which transformssystem call execution traces into a CCS model. The reader canrefer to [25] for more details about the function T . Roughlyspeaking, the function T consists of an iterative procedure thatstarts from an initial raw main process, which includes all thesystem traces as alternative branches. At each step, a morecompact process is obtained, through (sub)processes mergeand reduction. The model described by the CCS processesis simpler and more compact than one directly given as atransition system; moreover, all model checking environmentcan easily obtain the corresponding transition system.Formal analysis of malware evolution. This last processaims at discovering Android malware evolution through modelchecking of selective mu-calculus formulae. In our approach,we invoke the Concurrency Workbench of New Century(CWB-NC) [7] as formal verification environment. SinceCWB-NC is no longer in active development, as future workwe want to substitute CWB-NC with CAAL (standing forConcurrency workbench developed at AALborg university)[1],which supports CCS as input specification language (as CWB-NC), but uses a more efficient algorithm to perform modelchecking. Actually, there exist mature tools with moderndesigns like CADP [12] with expressive input languages andefficient analysis methods. However, the aim of this paper isto develop an initial rapid research prototype to evaluate ourmethodology. The idea of our method is to formally inferderivation relationships or other aspects of evolution. Moreprecisely, the core of our method is as follows:First step. Initially, we recognize the distinctive features ofthe malware behaviour with respect to a specific malwarefamily. This specific behaviour is written as a set of propertiesexpressed in selective mu-calculus logic. To specify the prop-erties, we manually inspected a few samples in order to findthe malware malicious behavior implementation at system callexecution traces level. We want to find relationships betweenmalware. Thus, we check each formula ϕF characterizingthe specific family F on each app of all the families of a

fixed dataset. We report, for each formula, the number andthe percentage of the apps that satisfy ϕF . From this analysiswe consider the family X as ”ancestor” of the family Y ifthe formula ϕX , characterizing the family X , is true on morethan the 35% of the apps belonging to Y . The choice of thisvalue comes from experimental evaluations. The reason is thatif “many” apps, belonging to the family Y , satisfy the formulacharacterizing the family X it means that the family Y sharea common malware behavior and that X can be considered asan ancestor of Y .Second step. To validate the previous result, the logic for-mulae which identify specific malware behaviors of a familyare combined to confirm the right evolutionary relationshipsamong families. For example, if X is the “ancestor” of thefamily Y , we also check the formula ϕ = ϕX ∨ϕY on all theapps and we compare the results of this verification with thoseof the verification of ϕX . It must hold that the percentageof the apps of the family X that satisfy ϕ is very close tothe percentage of the apps of the family X that satisfy ϕX ,meaning that the family Y introduces new malware behaviorthat family X did not have, thus no many other apps are keptby the introduction of the formula ϕY in the formula ϕ. Thesame holds for the other families different from Y . On theother hand, the percentage of the apps of the family Y thatsatisfy ϕ must be much bigger than the percentage of the appsof the family Y that satisfy ϕX , indicating that a new malwarebehaviour has been introduced in the family Y captured by theformula ϕY .

As shown in the next section, experimental results suggestthat the approach is remarkably accurate and deals efficientlywith consistent databases of Android malware samples.

The novelty of the approach is the use of temporal logicformulae to infer malware evolution, differently from [15]where vector similarity is used or from [6], [8], where graphmatching is adopted. To the Authors’ knowledge, modelchecking has never used before for the malware evolution. Themain distinctive features of the approach proposed in this paperare: (i) the use of formal methods to infer malware evolution;

Page 4: Model Checking for Mobile Android Malware Evolution · malware evolution and to find ancestor-descendant relation-ships between malware samples implemented in the droid-Sapiens tool

TABLE ITHE DATASET.

Family #samples dateGeinimi 73 12-2010Plankton 81 06-2011DroidKungFu 183 08-2011Opfake 423 2013FakeInstaller 98 2014

(ii) to contribute to the current mobile malware research bypointing to the evolution of possible vulnerabilities concern-ing the Android platform; (iii) to demonstrate that Androidmalware is not developed by zero, but it usually derives fromprevious malware that mutates adding functionalities to evadethe detection, but preserving part of the previous behaviour;(iv) relating to the previous point, to propose an useful methodto malware analysts to predict future threats; (v) robustnesswith respect to trivial code obfuscation techniques.

III. THE EVALUATION

In this section we present the preliminary results of theexperiments we performed in order to demonstrate the effec-tiveness of our method to track mobile malware phylogenesys.The results have been obtained through the prototype tooldroidSapiens. We apply our method on a malware datasetcomposed by 858 samples grouped in 5 malware familiesas Table I shown. The table reports the families involved inthe evaluation, with details about the number of the samplesbelonging to each family (#samples column) and the discoverydate with the month (when available) and the year (datecolumn).

We retrieved the Android malware applications from bothGenoma [27] and Drebin [2] dataset.

The dataset quality was verified using the VirusTotal ser-vice3 consisting of 57 antimalware running on the datasetapplications. This service allows to filter the malware appli-cations that were not recognized as malicious from at leastfive antimalware. The common characteristics of the malwareapplications belonging to a specific family are captured by alogic formula. In particular, we have 5 logic formulae for the5 families of our dataset: ϕG for the Geimini family; ϕP forthe Plankton family; ϕDKF for the DroidKungFu family; ϕOFfor the Opfake family; ϕFI for the FakeInstaller family.The above formulae have been specified after a manual inspec-tion of a few samples in order to find the malware maliciousbehavior at system call execution traces level. For example,property ϕP is the following: 〈writev〉∅ 〈fstat64〉∅ tt, mean-ing that it is possible that a writev action occurs followed bythe action fstat64, where writev system call writes buffersof data to the file associated with the file descriptor andfstat64 gets information about a file with the request of thepermissions on all of the directories in paths that lead to thatfile. For lack of space we omit the definition of all the otherformulae, but all the formulae describe the malicious behaviorof the respective family.

3http://virustotal.com

Following the methodology explained in the previous sec-tion, we construct the Table II which shows, for each formulaϕF (where F is a family of our dataset), the number (resp. thepercentage) of apps satisfying ϕF . Note that when verifyingthe formula ϕF , the higher percentage (in bold in the table) isobtained on the family F , confirming that the formula reallycharacterizes the family.

From the analysis of Table II, we infer the malware familiesphylogenetic tree, as described in Fig. 2. In our experiment,it holds that Geinimi is the ancestor of both Plankton andDroidKungFu, since ϕG is true on more that the 35% of theapps of these families. Moreover, for the same reason, wecan infer that DroidKungFu is the ancestor of Opfake andOpfake is the ancestor of FakeInstaller, while Plankton has nodescendant. This is also confirmed by the discovery date ofTable I.

The phylogenetic tree is also validated by the analysis ofmalicious behaviours characterizing the families. As matterof fact, Plankton has no descendant because it representsthe first example of update-attack malware, i.e., malware thatdo not embed the malicious payload at installation time butit will be downloaded at run-time. This behaviour is notrepresented in other families and for this reason Planktondoes not exhibit descendants. His ancestor is Geinimi because,in according to [27] both Plankton and Geinimi make anextensive use of network resources to steal sensitive infor-mation using a botnet without using SMSs to fraud creditto victims. Also DroidKungFu uses network resource withoutsend fraud SMSs, but differently to Geinimi (and Plankton),it employs the exploid and the RATC/Zimperlich techniquesto escalate the privileges of infected devices. In addiction tothe information sent by Geinimi family, DroidKungFu sendsthe OS/operator/network type and the information stored in theSD/phone memory to multiple locations. For these reasons, weconsider the Plankton and DroidKungFu at the same tree levelwhile Geinimi is their ancestor because the Geinimi familywas the first in our dataset to introduce the exhaustive usage ofnetwork resources (the SMS fraud will appear later) to collectlocation coordinates, the IMEI and the IMSI. Differently,Plankton introduces a similar kind of infection using theupload-attack and the DroidKungFu family adds to this attackthree different privilege escalation techniques but using theGeinimi technique to embed the payload.

Opfake family samples are descendants of DroidKungFufamily because they introduce the server-side metamorphismand the ability to send out premium-rated SMSs from a user’smobile phone without consent but using network resource inan exhaustive way to send information and to download a fakeversion of Opera browser passing itself off as the installer fora legitimate application.

FakeInstaller family persists to use the network with theusual aim to send sensitive information, but in addition itsends SMS to fraud victims as Opfake samples. In additionto the Opera browser (as Opfake family), it has spoofed theOlympic Games Results App, Skype, Flash Player, and otherapplications.

Page 5: Model Checking for Mobile Android Malware Evolution · malware evolution and to find ancestor-descendant relation-ships between malware samples implemented in the droid-Sapiens tool

TABLE IIFORMAL ANALYSIS OF MALWARE EVOLUTION - FIRST STEP.

aaaaaaaaaaaaFormulae

Family(Number of apps) Geimini

(73)Plankton

(81)DroidKungFu

(183)OpFake(423)

FakeInstaller(98)

ϕG 60 (82%) 38 (46%) 72 (39%) 135 (31%) 14 (14%)ϕP 5 (6%) 54 (66%) 12 (6%) 20 (4%) 13 (13%)ϕDKF 8 (11%) 13 (16%) 145 (79%) 155 (36%) 16(16%)ϕOF 20 (27%) 23 (28%) 55 (30%) 229 (54%) 45 (45 %)ϕFI 18(24%) 15 (18%) 51 (27%) 140(33%) 51 (52%)

Fig. 2. Malware Families Phylogenetic Tree

Summarizing, Opfake and FakeInstaller families, as Geinimiand DroidKungFu ones, embed the malicious code into theapplications differently from Plankton, for this reason Planktondescent line has no descendants, while the other branch of thephylogenetic tree of Fig. 2, sharing the same technique ofpayload installation, represents the families with new behav-iors implemented into the descendants (but keeping part of theancestor family behavior).

According to the second step of our methodology, wecombine the formulae of Table II to validate the inferredphylogenetic tree. The combined formulae and the verificationresults are reported in Table III. The table is divided in threeblocks. The first block of rows concerns the formulae relatingtwo families belonging to the relations ancestor-descendant.Let us consider, for example, Geimini and Plankton. We haveinferred that Geimini is the ancestor of Plankton. Thus, weverify the formula ϕG ∨ϕP and we compare the results of itsverification with those of the verification of the formula ϕG(the ancestor), see the upper graphic in Fig. 3. The comparisonshows that Plankton is really an evolution of Geimini. In fact,on the apps belonging to the Geimini family, we have anincrease of only 7% between the apps that verify ϕG and thosethat verify ϕG ∨ ϕP (82% for ϕG and 89% for ϕG ∨ ϕP ), as

expected. The reason is that Plankton introduces new malwarebehavior that Geimini did not have, thus no many other appsare kept by the formula ϕG∨ϕP . Also a little increment can befound when verifying and compare the above two formulae forthe other families different from Plankton, i.e., DoridKungFu,OpFake and FakeInstaller. In this case the reason is that norelationship has been found among them and Plankton, seeFig. 2. On the other hand, on the apps belonging to Planktonfamily the increase is significant (over than 25%), stating thata new malware behaviour has been introduced in Planktonfamily captured by the formula ϕP .

The other three formulae of the first block of Table III of theform ϕX∨ϕY , with a ancestor-descendant relationship, have asimilar trend, since Geimini is the ancestor of DroidKongFu,DroidKongFu is the ancestor of OpFake and OpFake is theancestor of FakeInstaller, as shown in Fig. 2.

Let us consider the second block of rows of Table III. In thiscase we analyze formulae like ϕX∨ϕY , where no relationshipof ancestor-descendant has been found between X and Y ,like for example Plankton/DroidKongFu, Plankton/OpFake, orPlankton/FakeInstaller. As an example, let us consider thelower graphic in Fig. 3 which shows this situation for X =Plankton and Y = DroidkongFu. Now, we verify ϕX ∨ ϕYand we compare the results with ϕX . The verification resultshighlight that: (i) a significant increment for the apps belong-ing to the DroidKongFu family; (ii) an increment for the appsbelonging to the OpFake family, which is the descendant ofDroidKongFu; (iii) a little increment for the other families.

Finally, in the last row of Table III the formula ϕG ∨ϕDKF ∨ϕOF ∨ϕFI is satisfied by more than 70% by all thefamilies confirming that: (i) “Geimini-DroidKongFu-OpFake-Fakeinstaller” belongs to the ancestor-descendant line of thetree in Fig. 2; (ii) Plankton is the descendant of the Geiminifamily.

Table IV shows the time verification for all samples of ourdataset. With Tex we refer to the time employed to retrievesystem calls (i.e., 60 seconds for each application), with Tmodto the time required to build the model and with Tchk to thetime to verify the properties. The TTOT value is the sum ofall these contributes. The machine used to run the experimentsand to take measurements was an Intel Core i5 desktop with 4gigabyte RAM, equipped with Microsoft Windows 7 (64 bit).

Page 6: Model Checking for Mobile Android Malware Evolution · malware evolution and to find ancestor-descendant relation-ships between malware samples implemented in the droid-Sapiens tool

TABLE IIIFORMAL ANALYSIS OF MALWARE EVOLUTION - SECOND STEP.

aaaaaaaaaaaaFormulae

Family(Number of apps) Geimini

(73)Plankton

(81)DroidKungFu

(183)OpFake(423)

FakeInstaller(98)

ϕG ∨ ϕP 65 (89%) 59 (72%) 81 (44%) 136 (32%) 27 (27%)ϕG ∨ ϕDKF 65 (89%) 50 (61%) 152 (83%) 286 (67%) 30 (30%)ϕDKF ∨ ϕOF 28 (38%) 30 (37%) 151 (82%) 250 (59%) 60 (61%)ϕOF ∨ ϕFI 21 (28%) 23 (28%) 61 (33%) 248 (58%) 51 (52%)ϕP ∨ ϕDKF 13 (17%) 65 (80%) 150 (81%) 173 (40%) 16 (16%)ϕP ∨ ϕOF 21 (28%) 62 (76%) 65 (35%) 230 (54%) 58 (59%)ϕP ∨ ϕFI 19 (26%) 60 (74%) 61 (33%) 141 (33%) 64 (65%)ϕG ∨ ϕDKF ∨ ϕOF ∨ ϕFI 70 (95%) 57 (70%) 160 (87%) 335 (79%) 76 (77%)

Geinimi Plankton DKungFu Opfake FakeInstaller0%

50%

100%

82

46

39

31

14

89

72

44

32

27Perc

enta

ge

ϕG ϕG ∨ ϕP

Geinimi Plankton DKungFu Opfake FakeInstaller0%

50%

100%

6

66

6 4

1317

80 81

40

16

Perc

enta

ge

ϕP ϕP ∨ ϕDKF

Fig. 3. Comparison between formulae.

TABLE IVTIME VERIFICATION IN SECONDS

Family Tex Tmod Tchk TTOT

Geinimi 4380 2.173 3.386 4385.559Plankton 4860 1.386 2.481 4863.867DroidKungFu 10980 7.791 8.141 10995.932Opfake 25380 2.34 11.576 25393,916FakeInstaller 5880 1.266 2.403 5883.669

IV. RELATED WORK

Several studies are focused on the malware phylogenytopic and, generally speaking, on malware evolution one:in this section we discuss both of them. In [27], authorsmanually analyze more than 1,200 malware samples to coverthe majority of existing Android malware families, rangingfrom their debut in August 2010 to recent ones in October

2011. Their characterization and a subsequent evolution-basedstudy of representative families reveal that they are evolvingrapidly to circumvent the detection from existing mobile anti-virus software. Ramu [22] provides an overview of evolutionfor mobile malware, attack vectors, detection methodologiesand defense mechanisms, concluding that are still in its infancystage. He highlights that mobile malware classes have somesimilarity with PC malware, mobile devices have uniquecharacteristics that can be targeted by attackers. Relatingto research phylogeny-oriented, in [15] a method to realizephylogeny models is proposed. Considering a set of features,called n-perms, authors match possibly permuted code in orderto obtain a malware tree. Basing on the observation of theobtained malware tree authors suppose that phylogeny modelsbased on n-perms are able to support the identification ofnew malware variants and the reconciling of inconsistenciesin malware naming. The described method is static and henceit does not require malware execution. Indeed, with respect toour method, this approach is less robust to code obfuscationsthat cannot be represented as permutations, while our method,performing a dynamic analysis, is resilient to code obfusca-tions. Authors in [26] discuss a framework defining malwareevolution relations in terms of path patterns on derivationgraphs. The limitation of this approach is that the model’sdefinition of source code excludes machine generated code.This represents a restriction, considering that usually maliciouscode is automatically generated from the existing maliciousapplications. In [17], authors propose a data-centric approachbased on packets inspections aiming to automatically iden-tify shellcode similarity. They basically generate a shellcodephylogeny for a given vulnerability. The main difference withrespect to our method is represented by the fact that we executethe samples in order to perform a more deep analysis. TheProcess Mining-based approach proposed in [4] consists inanalyzing system call traces gathered from Android malwareand trusted applications to identify a set of relationships andrecurring execution patterns that characterize their respectivebehavior. Researchers in [16] collect logs derived from instruc-tions executed, memory and register modifications in orderto build the phylogenetic tree at runtime. Furthermore, theyshow that network resources were useful for visualizing shortnop-equivalent code metamorphism than trees. They consider

Page 7: Model Checking for Mobile Android Malware Evolution · malware evolution and to find ancestor-descendant relation-ships between malware samples implemented in the droid-Sapiens tool

execution trace of alternating APIs and user procedures: in thiscase the difference with respect to our method is representedby the fact that we consider system calls trace, that are notaffected by API version.

V. CONCLUSION

In this paper we propose an approach based on modelchecking able to track mobile malware phylogenesys imple-mented in the droidSapiens tool. We extract syscall and fromthe retrieved traces we build the phylogenetic tree identifyingthe ancestor and the descendant between mobile malwarefamilies. As future work we intend to investigate the use ofthe k-bsimulation [11], [24] to measure the similarity amongmalware families. Furthermore, we intend to investigate themultiple ancestors. Comparisons of droidSapiens with othertools will be performed.

ACKNOWLEDGMENTS

This work has been partially supported by H2020 EU-funded projects NeCS and C3ISP and EIT-Digital Project HII.The authors thank Rebecca Viglione for her support duringthe experiment.

REFERENCES

[1] J. R. Andersen, N. Andersen, S. Enevoldsen, M. M. Hansen, K. G.Larsen, S. R. Olesen, J. Srba, and J. K. Wortmann. Caal: Concurrencyworkbench, aalborg edition. Theoretical Aspects of Computing–ICTAC2015, page 573, 2015.

[2] D. Arp, M. Spreitzenbarth, M. Huebner, H. Gascon, and K. Rieck.Drebin: Efficient and explainable detection of android malware in yourpocket. In NDSS, 2014.

[3] R. Barbuti, N. De Francesco, A. Santone, and G. Vaglini. Selective mu-calculus and formula-based equivalence of transition systems. Journalof Computer and System Sciences, 59(3):537–556, 1999.

[4] M. L. Bernardi, M. Cimitile, and F. Mercaldo. Process mining meetsmalware evolution: a study of the behavior of malicious code. In2015 Fourth International Symposium on Computing and Networking(CANDAR), December 2016.

[5] G. Canfora, E. Medvet, F. Mercaldo, and C. A. Visaggio. Detectingandroid malware using sequences of system calls. In Proceedings ofthe 3rd International Workshop on Software Development Lifecycle forMobile, DeMobile 2015, pages 13–20, New York, NY, USA, 2015.ACM.

[6] E. Carrera and G. Erdelyi. Digital genome mapping. In Virus Bulletin,2004.

[7] R. Cleaveland and S. Sims. The ncsu concurrency workbench. In R. Alurand T. A. Henzinger, editors, CAV, volume 1102 of Lecture Notes inComputer Science. Springer, 1996.

[8] T. Dullien and R. Rolles. Graph-based comparison of executable objects(english version). SSTIC, 5:1–3, 2005.

[9] A. P. Felt, M. Finifter, E. Chin, S. Hanna, and D. Wagner. A survey ofmobile malware in the wild. In Proceedings of the 1st ACM workshopon Security and privacy in smartphones and mobile devices, pages 3–14.ACM, 2011.

[10] W. B. Frakes and K. Kang. Software reuse research: Status and future.IEEE transactions on Software Engineering, 31(7):529–536, 2005.

[11] N. D. Francesco, G. Lettieri, A. Santone, and G. Vaglini. Grease: Atool for efficient ”nonequivalence” checking. ACM Trans. Softw. Eng.Methodol., 23(3):24:1–24:26, 2014.

[12] H. Garavel, F. Lang, R. Mateescu, and W. Serwe. CADP 2011: atoolbox for the construction and analysis of distributed processes. STTT,15(2):89–107, 2013.

[13] M. Hayes, A. Walenstein, and A. Lakhotia. Evaluation of malware phy-logeny modelling systems using automated variant generation. Journalin Computer Virology, 5(4):335–343, 2009.

[14] J. Jang, D. Brumley, and S. Venkataraman. Bitshred: feature hashingmalware for scalable triage and semantic analysis. In Proceedings ofthe 18th ACM conference on Computer and communications security,pages 309–320. ACM, 2011.

[15] M. E. Karim, A. Walenstein, A. Lakhotia, and L. Parida. Malwarephylogeny generation using permutations of code. Springer, 2005.

[16] W. M. Khoo and P. Lio. Unity in diversity: Phylogenetic-inspiredtechniques for reverse engineering and detection of malware families.In SysSec Workshop (SysSec), 2011 First, pages 3–10. IEEE, 2011.

[17] J. Ma, J. Dunagan, H. J. Wang, S. Savage, and G. M. Voelker. Findingdiversity in remote code injection exploits. In Proceedings of the 6thACM SIGCOMM Conference on Internet Measurement, IMC ’06, pages53–64, New York, NY, USA, 2006. ACM.

[18] W. P. Maddison and D. R. Maddison. Macclade: analysis of phylogenyand character evolution. Evolution (PMBD, 185908476), 1992.

[19] F. Mercaldo, V. Nardone, A. Santone, and C. A. Visaggio. Downloadmalware? no, thanks. how formal methods can block update attacks. InFormal Methods in Software Engineering (FormaliSE), 2016 IEEE/ACM4th FME Workshop on, pages 22–28. IEEE, 2016.

[20] F. Mercaldo, V. Nardone, A. Santone, and C. A. Visaggio. Ransomwaresteals your phone. formal methods rescue it. In International Confer-ence on Formal Techniques for Distributed Objects, Components, andSystems, pages 212–221. Springer, 2016.

[21] R. Milner. Communication and concurrency. PHI Series in computerscience. Prentice Hall, 1989.

[22] S. Ramu. Mobile malware evolution, detection and defense. EECE571B, Term Survey Paper, 2012.

[23] V. Rastogi, Y. Chen, and X. Jiang. Catch me if you can: Evaluatingandroid anti-malware against transformation attacks. IEEE Transactionson Information Forensics and Security, 9(1):99–108, 2014.

[24] G. D. Ruvo, G. Lettieri, D. Martino, A. Santone, and G. Vaglini. k-bisimulation: A bisimulation for measuring the dissimilarity betweenprocesses. In Formal Aspects of Component Software - 12th Interna-tional Conference, FACS 2015, Niteroi, Brazil, October 14-16, 2015,Revised Selected Papers, volume 9539 of Lecture Notes in ComputerScience, pages 181–198. Springer, 2015.

[25] A. Santone and G. Vaglini. Conformance checking using formalmethods. In Proceedings of the 11th International Joint Conference onSoftware Technologies (ICSOFT 2016) - Volume 1: ICSOFT-EA, Lisbon,Portugal, July 24 - 26, 2016., pages 258–263. SciTePress, 2016.

[26] A. Walenstein and A. Lakhotia. A transformation-based model ofmalware derivation. In Malicious and Unwanted Software (MALWARE),2012 7th International Conference on, pages 17–25, 2012.

[27] Y. Zhou and X. Jiang. Dissecting android malware: Characterizationand evolution. In Security and Privacy (SP), 2012 IEEE Symposium on,pages 95–109. IEEE, 2012.