hacs: evaluating hardware computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...hacs:...

12
HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract Malware families have distinct hardware computational signatures that can be used to recognize existing malware at runtime. However, systematically evaluating malware detectors when malware samples are hard to run correctly and can adapt their computational characteristics is a hard problem. We introduce HaCS, a malware analysis platform that includes both extant mobile malware and a synthetic mal- ware generator that can be configured to generate a com- putationally diverse set of malware samples. HaCS also includes a set of computationally diverse benign applica- tions that can be used to repackage malware into, along with a recorded trace of over 1 hour long real human us- age for each app. HaCS thus enables an analyst to replay realistic malware and benign executions, and test malware detection schemes in a reproducible manner. Using HaCS, we demonstrate that malware can evade the best known malware-signature based detection by modifying its com- putation while retaining its behavior and efficiency. We then construct a novel anomaly detection technique that constructs models of benign programs’ computations and detects malware as computational anomalies. We quantitatively demonstrate that benign applications’ signatures can be learnt even with noise due to variations in human usage and system state, and that malware pay- loads that are relatively small at the system and user level (such as stealing tens of SMSs or photos) measurably alter a benign program’s hardware signature. Interestingly, mal- ware features such as Java reflection and encryption that thwart static analyses in turn make HaCS analyses work better – malware computations become more anomalous. We further show that false positives are a concern with computational analyses – 6% for 80% true positives – and more work is required to incorporate higher level semantic signals to reduce false positives. 1 Introduction Mobile malware is an important and growing prob- lem [1, 2, 3, 4]. Of the 1M or so applications (or “apps”) in the official Google Play store, 42,000 were classified as malware in 2013 – up from 10,000 of 0.4M apps in 2011 [2]. Third-party app stores contain even more malware. Trendi- cro has reported detecting 1.3M unique samples of malware across all third-party app stores in 2013 [4]. Users looking for free or discontinued versions of popular applications of- ten end up with the application repackaged with a malicious payload. Although Google uses tools such as Bouncer [5] and Verify [6] to analyze apps, researchers have shown that these tools can be evaded relatively easily [7]. Dynamic analysis is a particularly effective technique to detect mobile malware [8]. One reason is that devel- oper errors leave apps vulnerable at run-time. For example, web views used to embed webpages in mobile apps have bugs that can be exploited using malicious inputs [9]. Al- ternately, a poorly designed code update mechanism in a third-party library (e.g., the “AdVulna” library) can al- low untrusted, unsigned code to be loaded at run-time and executed with permissions of the underlying app [10]. Ob- fuscated code using reflection and encryption also hampers static analysis. In such cases, dynamic analyses comple- ments static analyses to detect malware. Hardware computational signals – the dynamic in- struction stream of a program and its effect on micro- architectural structures – can potentially be used to identify malware. Hardware analyses have the advantage that even if operating system (OS) level analyses are compromised, malware execution should still be visible at the hardware level. A hardware monitor offers a smaller trusted code base compared to OS-level analyses, can be executed in a secure container [11], and can potentially be used in conjunction with OS-level analyses to improve detection. For example, Demme et al. [12] use performance coun- ters as hardware signals of mobile (Android) applications – both malicious and benign – and train classifiers on exist- ing malware and known benign programs. Such malware- signature detection yields close to 80% true positives with around 20% false positives for thread-level detection of a set of existing malware. This is an encouraging start given the absence of OS and app level semantic information. In this paper, we evaluate hardware signals based de- tection against the unique attributes of modern mobile malware. First, mobile malware samples today have a very short life-span [13], and ensuring that a malware pay- load has executed correctly requires care. Its command- and-control (C&C) servers’ functionality may have to be reverse-engineered and replayed, its checks for a specific geographical location or an Android platform version or even an emulator-based execution have to be bypassed, and specific user actions that trigger the payload have to be executed. Further, the mobile device has to have real data (SMSs, photos, etc) and network connectivity so that the payload can execute realistically (e.g., exfiltrate private data). Most malware samples we found online [14, 15, 16] did not execute their payload correctly. Secondly, since mobile malware spreads primarily through repackaged apps – essentially tricking users into giving up sensitive data [17, 18] – malware do not typi- cally rely on root exploits to succeed. For mobile malware, detection thus has to focus on malware payloads. Thirdly, malware can easily adapt to thwart proposed defenses. Malware can adapt its concrete computation in order to try and evade signature-based detectors while retaining its semantic behavior. Or, malware payloads can be repackaged into a favorable application which already uses sensitive permissions such as GPS, camera, or contacts for legitimate uses. Can computational signatures tell apart 1

Upload: others

Post on 13-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

HaCS: Evaluating Hardware Computational Signatures forMobile Malware Detection

AbstractMalware families have distinct hardware computational

signatures that can be used to recognize existing malwareat runtime. However, systematically evaluating malwaredetectors when malware samples are hard to run correctlyand can adapt their computational characteristics is a hardproblem.

We introduce HaCS, a malware analysis platform thatincludes both extant mobile malware and a synthetic mal-ware generator that can be configured to generate a com-putationally diverse set of malware samples. HaCS alsoincludes a set of computationally diverse benign applica-tions that can be used to repackage malware into, alongwith a recorded trace of over 1 hour long real human us-age for each app. HaCS thus enables an analyst to replayrealistic malware and benign executions, and test malwaredetection schemes in a reproducible manner. Using HaCS,we demonstrate that malware can evade the best knownmalware-signature based detection by modifying its com-putation while retaining its behavior and efficiency. Wethen construct a novel anomaly detection technique thatconstructs models of benign programs’ computations anddetects malware as computational anomalies.

We quantitatively demonstrate that benign applications’signatures can be learnt even with noise due to variationsin human usage and system state, and that malware pay-loads that are relatively small at the system and user level(such as stealing tens of SMSs or photos) measurably altera benign program’s hardware signature. Interestingly, mal-ware features such as Java reflection and encryption thatthwart static analyses in turn make HaCS analyses workbetter – malware computations become more anomalous.We further show that false positives are a concern withcomputational analyses – ∼6% for ∼80% true positives– and more work is required to incorporate higher levelsemantic signals to reduce false positives.

1 IntroductionMobile malware is an important and growing prob-

lem [1, 2, 3, 4]. Of the 1M or so applications (or “apps”)in the official Google Play store, 42,000 were classified asmalware in 2013 – up from 10,000 of 0.4M apps in 2011 [2].Third-party app stores contain even more malware. Trendi-cro has reported detecting 1.3M unique samples of malwareacross all third-party app stores in 2013 [4]. Users lookingfor free or discontinued versions of popular applications of-ten end up with the application repackaged with a maliciouspayload. Although Google uses tools such as Bouncer [5]and Verify [6] to analyze apps, researchers have shown thatthese tools can be evaded relatively easily [7].

Dynamic analysis is a particularly effective techniqueto detect mobile malware [8]. One reason is that devel-oper errors leave apps vulnerable at run-time. For example,web views used to embed webpages in mobile apps have

bugs that can be exploited using malicious inputs [9]. Al-ternately, a poorly designed code update mechanism ina third-party library (e.g., the “AdVulna” library) can al-low untrusted, unsigned code to be loaded at run-time andexecuted with permissions of the underlying app [10]. Ob-fuscated code using reflection and encryption also hampersstatic analysis. In such cases, dynamic analyses comple-ments static analyses to detect malware.

Hardware computational signals – the dynamic in-struction stream of a program and its effect on micro-architectural structures – can potentially be used to identifymalware. Hardware analyses have the advantage that evenif operating system (OS) level analyses are compromised,malware execution should still be visible at the hardwarelevel. A hardware monitor offers a smaller trusted codebase compared to OS-level analyses, can be executed ina secure container [11], and can potentially be used inconjunction with OS-level analyses to improve detection.

For example, Demme et al. [12] use performance coun-ters as hardware signals of mobile (Android) applications –both malicious and benign – and train classifiers on exist-ing malware and known benign programs. Such malware-signature detection yields close to 80% true positives witharound 20% false positives for thread-level detection of aset of existing malware. This is an encouraging start giventhe absence of OS and app level semantic information.

In this paper, we evaluate hardware signals based de-tection against the unique attributes of modern mobilemalware. First, mobile malware samples today have avery short life-span [13], and ensuring that a malware pay-load has executed correctly requires care. Its command-and-control (C&C) servers’ functionality may have to bereverse-engineered and replayed, its checks for a specificgeographical location or an Android platform version oreven an emulator-based execution have to be bypassed,and specific user actions that trigger the payload have tobe executed. Further, the mobile device has to have realdata (SMSs, photos, etc) and network connectivity so thatthe payload can execute realistically (e.g., exfiltrate privatedata). Most malware samples we found online [14, 15, 16]did not execute their payload correctly.

Secondly, since mobile malware spreads primarilythrough repackaged apps – essentially tricking users intogiving up sensitive data [17, 18] – malware do not typi-cally rely on root exploits to succeed. For mobile malware,detection thus has to focus on malware payloads.

Thirdly, malware can easily adapt to thwart proposeddefenses. Malware can adapt its concrete computationin order to try and evade signature-based detectors whileretaining its semantic behavior. Or, malware payloads canbe repackaged into a favorable application which alreadyuses sensitive permissions such as GPS, camera, or contactsfor legitimate uses. Can computational signatures tell apart

1

Page 2: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

1.%Record%%%

5.%App%Sign%%%%%Match?%

2.%Replay%

3.Trace%Gen%

Syscalls%etc%per%app%

Perf%counters%

App%Signatures%

Benign%App%

4.%Signature%Gen%

InCfield%App%%

Figure 1: Proposed system to construct computational signa-tures of benign apps and use them to test against signaturesof malware. Malware is allowed to adapt to avoid detection.benign usage from malicious usage of the same baselineapp?

And finally, current mobile applications include multiplelanguages, runtimes, sensors, and complex user and net-work interaction. Will there be a computational differencewhen a medical app that emails diagnostic data addition-ally pads the data with some SMSs? We wish to evaluatewhether hardware computational signatures are robust un-der such real-world usage, when applications are used forseveral minutes at a time by real users driving actual appfunctionality.

Our approach to answering these questions is to con-struct a machine learning toolkit to analyze hardware com-putational signals – HaCS – and then evaluate it againsta computationally diverse malware benchmark suite. Wereport our key contributions and findings below.

• We present a taxonomy of current mobile malware pay-loads based on prior surveys [19] and our own analysis of175 malware samples (belonging to 72 malware familiesfrom 2012-13). We classify malware behaviors broadlyinto information stealers, network nodes, and compu-tational nodes, and seed a malware benchmark suitewith representative malware samples for each behavior.We reverse-engineer these samples, fix them to workcorrectly, and complete their C&C server functionality.Further, we present a malware generator that allows ananalyst to choose behaviors, specify detailed parametersfor each behavior, select obfuscation techniques and abaseline application to repackage the payload into, andautomatically generates a computationally diverse set ofmalware samples.

• We prototype HaCS using two Android developmentboards to collect performance counter traces. We drivenine popular and diverse baseline applications usinghuman user input for over 1 hour each; we repackageeach baseline application with four extant malware and66 generated malware; we construct computational sig-natures using two machine learning algorithms; and thentest all 70 malware samples with over 7.5 minutes eachof real human usage. Beyond confirmed payload execu-tion and a diverse set of baseline apps, HaCS’ evaluationis based on an order of magnitude longer user traces thanstate of the art [12].

• We demonstrate that a blacklisting approach – based onrecognizing known malware signatures – can be evaded

by repackaged malware that adapts its computation toachieve the same behavior using different concrete exe-cutions. Importantly, malware can do so without com-promising its payload’s performance (Section 2).• We propose an alternate approach of constructing signa-

tures of benign apps and detecting malware as compu-tational anomalies. Our key insight is that benign appshave commonly executed behaviors that can be modeledthrough dynamic traces from running test inputs andreplaying real user inputs. On the other hand, malwarewill deliberately obfuscate its payload executions andevade a detector that looks for its signature. 1

• We achieve good detection rates for several malware-app combinations. Surprisingly, even a small payloadsuch as stealing tens of photos, SMSs, or even contactsin a batch detectably alters the compute signature of acomplex app like Angry Birds. User-driven apps likeSana-MIT or network-driven apps like TuneIn Radioare even more sensitive to perturbation. On average,false-positive rates for 30 seconds to 1 minute of benignapp usage is close to 10%, which indicates that hardwaresignals have to potentially be integrated with OS or applevel semantics to be used in practice. Obfuscation tech-niques like reflection and encryption, that thwart staticanalyses, actually make malware’s hardware traces moreanomalous. Finally, since very small payloads such asstealing device IDs are not detected, we conclude thatmalware can still escape, but by lowering its payload’sperformance.In summary, our goal is to enable reproducible research

into hardware based malware detection by providing a plat-form for correct and consistent execution of representativemalware samples.1.1 System Overview

We consider a system that has two synergistic compo-nents. On the server side, a platform provider (e.g., Google)executes applications using test and real user inputs, mea-sures performance counters, and creates a database of com-putational signatures. The platform provider then attachesthe signature to each application in the app-store. Onclient devices, a light-weight system service samples perfor-mance counters to create run-time traces from applications,and compares each run-time trace to database entries eitheron the device or the server.

Note that the signature database can contain either ablacklist (of malware executions [12]) or a whitelist (ofbenign executions). In a blacklist analysis, the system hasto compare each run-time trace with the entire databaselooking for a possible match. In a white-list analysis, eachrun-time trace purports to belong to a specific app – henceonly the app’s signature needs to be matched. If a malwareis detected, the system raises a signal to alert the user

1Concurrent work (preprint by Tang et al. [20]) has evaluated thepotential of hardware based anomaly detectors for desktop malware gen-erated using the metasploit tool. We focus on mobile malware payloads –which execute regular code but out of context – whereas their goal is todetect abnormal code typical of exploits.

2

Page 3: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

Angry Birds game + click fraud Sana medical app + SMS stealer Internet radio + password crackerFigure 2: (Issues with black-list approach) Malware can adapt its computational signature to avoid malware-signature baseddetection without reducing its own efficiency. Figures show probability of having seen similar execution phases in the malwaretraining set, so lower value implies less likelihood of being labeled as malware. The test malware (dark line) looks even moreunlikely to match malware training-set than benign programs (dashed line).and/or the system. The server will further refine its computesignature database using traces uploaded by clients. Forprototyping purposes, existing hardware limits us to using6 performance counters at a time and to using developmentboards instead of mobile phones.

2 MotivationThe best known technique of using performance counters

to detect malware [12] is similar to blacklisting approachesused in anti-virus products today. The supervised learn-ing technique trains a classifier on performance countertraces collected by executing instances of a malware family(along with execution traces of arbitrary benign programs).At deployment time, performance counter traces from anactual machine are fed to these classifiers (one per malwarefamily) in order to identify malicious activity.

The Achilles’ heel of the current state-of-the-art tech-nique is that malware can adapt its signature to evade detec-tion (Figure 2). A malware developer has the option to mod-ify their code while retaining the malware’s overall behav-ior. Hence, we experiment with training a Markov modelon one malware sample’s performance counter traces, andtesting using performance counter traces from a differentconcrete execution but the same behavioral payload.

Specifically, the new malware sample splits the originalmalicious workload into multiple threads (compared to asingle-threaded training set), and execute the payload ata better performance. We use three benign applications –one that is user-driven, one that is compute-heavy, and athird that is network intensive – and repackage them withmalware that (respectively) steals data, fetches web pagesto optimize the pages’ search engine ratings, and executesa password cracker. We train the Markov model on 3 tracesfor each app (where each trace is 96s – 240s long), and testit on the modified payload.

We defer the Markov model discussion to Section 4 andplot the results in Figure 2. Figure 2 shows the probabilitythat the testing trace is similar to the training set over time.Our experiment shows that in all three repackaged apps,the malware can find an alternate execution profile that isfar removed from its previous signature (over two standarddeviations from the training set). In fact, we find that thenew malware (dark line) is even more unlikely to be labeledas similar to the training set than the underlying benignprogram (the dashed line). This evasion shows that an alter-

Informa7on(Stealers(

Networked((Nodes(

Compute((Nodes(

2012# 2013# 2012# 2013# 2012# 2013#68.8%# 38.2%# 69.5%# 60.6# 0%# 0%#

Figure 3: Malware behaviors observed in a 72-family 175-sample Android malware set from Contagio minidump.Most malware steals data or carries out network fraud. How-ever, samples that use phones as compute nodes, e.g., to crackpasswords or mine bitcoins, have been reported in 2014.

Networked Node

‘12 ‘13

Click/apk

fraud

50% 83%

Scams: paid SMSs

35% 6%

Denial of service

10% 6%

Info Stealers

‘12 ‘13

Apps Info,

Email

30% 13%

Browser

info

19% 10%

Other Files 15% 13%

Info Stealers

‘12 ‘13

Phone

Info

65% 66%

SMS 58% 46%

Contacts 53% 33%

GPS 27% 18%

Figure 4: Examples of malware behaviors and their contri-bution to the malware dataset.native approach is required to detect new malware whoseconcrete executions can vary even if behavior remains thesame.

3 Computationally Diverse MalwareOur goal is to understand high-level malware behaviors

and then generate malware samples that are computation-ally diverse while achieving their behaviors. Splitting intothreads, launching Android services or activities, insertingarbitrary delays and legitimate-looking API calls into thepayload are all examples of computational diversity that donot alter the malware’s behavior.

Figure 3 shows our manual classification malware intohigh level behaviors. We studied 53 malware familiesfrom 2012 and 19 from 2013 – a total of 175 malwaresamples in 72 families – downloaded from public malwarerepositories [15, 21, 16]. We refer the reader to the NCSUmalware dataset [14] for a comprehensive study of over1000 malware samples to trace their families, origins, andbehaviors.

While substantially smaller in scale, our classification’sgoal is to identify knobs to change the computational be-havior of malware and to determine concrete values forthese knobs (e.g., amount and rate of data stolen). Further-more, we found that samples from the NCSU dataset don’treliably execute on current Android machines, so we choseto construct malware that we can run reliably and adapt

3

Page 4: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

precisely. We also use our classification to choose a smallbut relevant set of existing malware samples.

To classify malware, we disassembled the binaries(APKs on Android) and executed them on both an Androiddevelopment board and the Android emulator to monitor a)permissions requested by the application, b) middleware-level events (such as the launch of Intents and Services), c)system calls, d) network traffic, and e) descriptions of mal-ware samples from the malware repositories. We describeour key inferences below.

Analyzing payloads instead of exploits. We found thatmost mobile malware families achieve their end goals bygetting users to install overprivileged applications [17, 14]instead of through root exploits [22, 23, 24]. We observedroot exploits in 10/143 samples in 2012 and 3/32 samplesin 2013. Depending on the Android version, applicationscan request more than 150 permissions, over 60 of whichare labeled dangerous (e.g., access to the SD card). Further,developers ask for more permissions than they need [17],making average users inured to permission requests fromapplications. This corroborates the NCSU dataset [14],which found 86% of malware to be repackaged.

Behavior based classification. At a high level we as-signed every malicious payload to one or more of threebehaviors: information stealers, networked nodes, and com-putational nodes (Figure 4).

Information stealers look for sensitive data and upload itto the server. User-specific sensitive data includes contacts,SMSs, emails, photos, videos, and application-specificdata such as browser history and usernames, among oth-ers. Device-specific sensitive data includes identifiers –IMEI, IMSI, ISDN – and hardware and network informa-tion. The volume of data ranges from photos and videos atthe high end (stolen either from the SD card or recordedvia a surveillance app) to SMSs and device IDs on the lowend.

The second category of malicious apps requires compro-mised devices to act as nodes in a network (e.g., a botnet).Networked nodes can send SMSs to premium numbers andblock the owner of the phone from receiving a paymentconfirmation. Malware can also download files such asother applications in order to raise the ranking of a particu-lar malicious app. Click fraud apps click on a specific weblinks to optimize search engine results for a target.

Given the advances in mobile processors, we anticipateda new category of malware that would use mobile devicesas compute nodes, specifically mobile counterparts of desk-top malware that runs password crackers or bitcoin min-ers on compromised machines. This was confirmed byrecent malware samples whose payload was to mine cryp-tocurrencies [25]. We did not observe such malware untilearly 2014, and hence used password cracker as a compute-oriented malware payload.

Malware generation. Figure 5 shows the specifics ofeach malware type we currently include in HaCS. Theconcrete behavior and intensity of a malware sample isspecified in a configuration file. It also determines how

Synthetic Malware

Parameters (number of items)

Malware-Specific

Delay (ms)

# of RPKG Mal. Apks

Length

per Action

(sec)

Inst.

Count

(Million)

Steal files (4.2MB each)

1, 15, 35, 50 0, 1K, 5K 12 2.86 50.97

Steal contacts 25, 70, 150, 250 0, 10, 25 12 0.36 67.80

Steal SMSs 200, 400, 700, 1.7K 0, 15, 40 12 0.12 25.90

Steal IDs, GPS data size fixed 0, 200 2 4* 39.65

Click fraud (pages) 20, 80, 150, 300 0, 1K, 3K 12 0.40 44.40

DDos (slow loris) 500 connections 1, 40, 80, 200 4 425 49.70

SHA1 pass. cracker 10K, 0.5M, 1.5M, 2.5M 0, 20, 40 12 2.8E-5 1.9E-2

Figure 5: *Most, but not all, of the workload requires only100ms.Synthetic malware payloads used in our experiments: 4 infostealers, 2 networked nodes, and 1 compute node. The set-tings represent a small fraction but computationally diverseset of malware behaviors. We also repackage 3 existing mal-ware – Roidsec, click fraud, and md5 cracker – into allbaseline apps, and Geinimi comes repackaged into the Mon-keyJump game.the malware is activated: triggered at boot-time, whenthe repackaged app starts, as a response to user activity,or based on commands from a remote (C&C) server. Inall cases, malware communicates with the C&C server totransfer data or results of actions. The configuration filealso specifies network-level intensity of malware payloadin terms of data packet sizes and interpacket delays, anddevice-level intensity in terms of execution progress (interms of malware-specific atomic functions completed).

Specific parameters for malicious payload were chosenaccording to an empirical study of mobile malware [26].The last two columns in Figure 5 show the averagelength of an atomic action in the malware payload (dis-counting delays due to being scheduled out, for exam-ple), and the instruction count per action (e.g. stealing 1photo/contact/SMS, clicking on 1 webpage in click fraud,opening 500 connections and keeping them alive in a DDoSattack, generating 1 string and computing its hash usingSHA1).

The generated malware has a top-level dispatcher ser-vice that serves as an entry point to the malicious program;it parses the supplied configuration file, launches the re-maining services at random times, and configures them.Malicious services can run simultaneously or sequentiallydepending on the value of the corresponding option in theconfiguration file. In some cases, the service that executesa particular malicious activity can serve as a dispatcher. Forexample, the service executing click fraud spawns a fewJava threads to avoid blocking on network accesses. Everyspawned thread is provided with a list of URLs that it mustaccess. Besides Android services, we register a listener tointercept certain incoming SMS messages, forward themto C&C server, and remove them from the phone if needed.This listener simulates bank Trojans. In most cases, theresults of executing a particular malicious activity are re-ported to the C&C server via TCP. We found most apps areobfuscated using the standard Proguard tool. We also ap-plied Proguard to the source code of our malicious programif we did not use reflection and encryption.

Malware repackaging. Repackaging malware into

4

Page 5: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

a baseline app involves disassembling the app (usingapktool) and starting with the Manifest.xml file that de-scribes the application components and their interfaces. Weinsert code into the Main activity to launch the top-levelmalware service (whose activation trigger can be config-ured) and files into the apk that represent the maliciouspayload. We then reassemble the decompiled app usingapktool. If code insertion has been done correctly, apk-tool produces a new Android app, which must be signed byjarsigner before deployment on a real device. Repackagingan Android app is not hard – this is why repackaging issuch a popular way of distributing Android malware.

4 Computational SignaturesThe traces we collected from performance counters are

multidimensional (one for each performance counter), high-frequency, bursty time-series data. It is a challenging taskto efficiently and effectively analyze these data, model thenormal behavior of the apps, and detect novel patternsresulted from unknown malware.

To capture high-level structural information of long timeseries which have repetitive but non-periodic patterns, weperform time-frequency analysis on the data and use astate-based Markov model [27] to capture the patterns inthe execution traces and construct computational signaturesfor applications. We also experiment with a bag-of-wordbased method similar to [28]. These two models captureapps’ behavior from different aspects: temporal pattern(Markov model) and statistical profile (BoW). For eachmodel, we determine values for all numeric parametersbased on an extensive search of the parameter space andoptimizing for best detection results.

Our anomaly detection based approach for detectingnovel malware consists of three major steps, includingdata preprocessing, data representation using states, andcomputational signature construction, which we discuss indetail in follows.4.1 Data Preprocessing

We observe that the performance counter traces are usu-ally very noisy and often contain large outliers that areorthogonal to our problem. Thus, before constructing thecomputational signatures for an application, we first pro-ceed to remove large outliers from its trace data.

To remove outliers from the data, we independently pro-cess each time series data from a performance counterin an application. For each time series that is availablefor constructing the model, we use the 99.99 percentileof all values in the trace as the cutoff value, and set allthe values in the trace that exceed the cutoff value tobe the cutoff value. After removing large outliers, wesmooth the data using moving average with a tunable win-dow size smooth_window. Our experiments show thatsmooth_window =100 ms provides a nice trade-off betweenremoving useless noise from and keeping useful details indata. After data smoothing, we normalize all data to range[0,1].

All parameters used in data preprocessing are saved, so

that incoming new data can go through exactly the sametransformation.4.2 State Representation of the Data

Due to the noise and transient variations in the timeseries, we do not construct our models directly from indi-vidual raw values of the traces. Instead, we use an approachsimilar to [27, 28] to derive a state (word) representation ofthe traces based on Discrete Wavelet Transform (DWT) andk-means clustering. The state representation of the tracesforms the basis for the states used in our Markov modeland for the codebook used in our bag-of-word model.

Feature extraction. We divide the long time seriestraces into short segments and extract a feature vectorfrom each local segment using Discrete Wavelet Trans-form (DWT). Specifically, we organize a multidimensionaltrace into a matrix, with rows denoting measurements overtimes, and columns denoting measurements from differentperformance counters. We then continuously slide a timewindow with predefined length, specified by parameterstate_window, which is typically in the range from 50msto 150ms, along the rows of the matrix to extract a set oflocal segments (i.e., submatrices) and order them by thestarting time of the window. We then perform a 2D DWTon the local segments, and use the approximation waveletcoefficients of DWT to form a feature vector to representeach local segments.

The wavelet transform can provide both accurate fre-quency information at low frequencies and time informa-tion at high frequencies, which are important for modelingthe execution behavior of the applications. We use a three-level DWT with order 3 Daubechies wavelet function (db3)to decompose a local segment. We also use Haar waveletfunction, but do not observe much difference in the de-tection results. Similar approach has been shown veryeffective in modeling a variety of time series data [29, 28].

State formation. With the feature vector representationof traces, we use k-means method to compute m qualitystates for the traces. Specifically, we perform k-meansclustering with m clusters on feature vectors derived fromall traces of an application that are available for training,and use the centroids computed by k-means as the statesfor the traces. Suppose a group of local segments X =[x1, x2, . . . ,xn], where xi ∈ Rd , are feature vectors of localsegments extracted from the time series traces. Then welearn from feature vectors a state codebook S ∈ Rd×m withm states, each of which is a d-length vector, the same lengthas the local segments. We only need to learn these statesonce from training data and it is universal for both trainingand test data.

Note that the current algorithm used for state formationdoes not guarantee a direct mapping between human-levelstates (e.g. staying on the same webpage and reading news)and the states in the Markov model because one high-levelstate may correspond to a chain of low-level states derivedfrom the app’s code and microarchitectural data.

States assignment. Once the state codebook is con-

5

Page 6: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

structed, a segment of trace is assigned the state whosevector has minimum distance with the feature vector ofthe trace segment. Specifically, suppose that a codebookwith m entries, S = {s1,s2, ...,sm}, is learned from the train-ing data. A local segment xi is assigned the cth codewordthat: c∗ = argmin jd(s j,xi), where d( · , ·) is the Euclideandistance function.4.3 Signatures from Markov Model

Our first approach to constructing computational signa-tures is based on the first-order Markov model, assumingthat the normal execution of an application (approximately)goes through a (limited) number of states, and the currentstate depends only on the previous state.

Construction of markov model. To capture importantstate transition in the application execution, we train ourMarkov model with a small number of states, typically be-tween 10 and 20, using application execution traces undernormal conditions. This is achieved by specifying smallnumber of clusters to k-means when we use it to computestates from the traces. The number of states equals to thenumber of clusters, which is determined by Bayesian In-formation Criterion (BIC) score [30], similar to previouswork [31] on program phase detection.

We then build a first-order Markov model based onthe states we derived. For a model with a finite m num-ber of states, the stationary Markov model can be de-fined by a m×m transition probability matrix P = [pi j]for i, j = 1 . . .m, and an initial probability distributionQ = [q1 q2 . . . qm] [32].

Using a sequence of observed state transition derivedfrom the multi-dimensional trace data, we can estimatethe transition matrix P and initial probability distribu-tion Q empirically using training data using a MaximumLikelihood Estimation (MLE) method [32]. The transi-tion probability pi j between state i and j is defined aspi j = Pr(st+1 = j|st = i). Then the MLE estimation of pi jis p̂i j = ni j/∑

mj=1 ni j, where ni j is the number of observa-

tion pairs of state i and state j.To robustly estimate all probabilities under limited ob-

servations, we use heuristics to set several transition prob-abilities (instead of estimating from data). For example,we set self-transition probabilities pii to be a small value(e.g., 0.2). This setting intuitively captures the bursty usagepattern of a user on an app: the application is either in idlemode for a while, or used by an user for a while.

Detection using markov model. With the trainedMarkov model, we can perform near real-time detection onnewly collected traces in an online way. When performancetraces are collected continuously, we take a segment of mul-tidimensional trace data of size state_window in sequence.For each segment, we perform the data preprocessing andcompute its state representation s. With a sequence of ktrace segments, we obtain a sequence of execution statess1, ...,sk of the trace. According to Markov property, theprobability that this sequence occurs in the context of the

App name Description # of

Installs

User Actions User Time

(min)

CPU Time

(min)

Inst. Count

(Billion)

Amazon internet

store

10M –

50M

searched for sporting goods; looked through 25 pages; clicked on 50 items

81.15 32.40 1,914.97

Angry

Birds

game 1M –

5M

played 9 rounds and

completed 7 levels

76.97 63.76 1,047.73

CNN news app 5M –

10M

browsed several categories

of news and a few articles of

each type

58.04 11.60 254.85

Firefox browser 50M –

100M

browsed 20 webpages starting from google.finance

93.96 45.51 1,464.52

Google

Maps

map

service

500M –

1B

browsed maps of a few cities and opened street views

56.09 35.38 768.31

Google

Translate

translator 500M –

1B

translated 30 words, searched history, tried handwriting recognition

59.72 12.12 203.61

Sana MIT

Medical

medical

app

U/A completed 5-6 questionnaires

111.41 11.37 145.94

TuneIn

Radio

internet

radio

50M –

100M

switched amongst 6 channels and listened to radio

78.10 26.17 407.99

Zombie

WorldWar

game 1M –

5M

played 5 rounds and completed 4 levels

91.62 88.40 2,261.99

Figure 6: Baseline apps that we repackaged with malware.Our goal was to choose complex apps that are used widelyand include a mix of compute (games), user-driven (browsers,medical app), and network-centric (radio) apps.given Markov model is computed as follows:

Pr(s1, . . . ,sk) = qs1

k

∏t=2

pst−1st . (1)

Intuitively, a malware usually does certain executionsthat a normal application rarely does. In such a case, mal-ware execution may go through a set of states that theapplication normally rarely go through. So we expect thisprobability goes low in presence of malware.

When the trace segments continuously come in over timet, we update the probability over time and obtain a probabil-ity curve. To deal with the noise in the data and add certainsecond-order property to the Markov model, we smooth theprobability curve using a moving average smoother. Ourintuition is that, instead of using a single probability asan indicator of suspicious activities, we use the average ofneighboring probability values as the indicator of anoma-lous behavior. Using the smoothed probability, we raise analarm when more than five consecutive probability valuesare below a given threshold.4.4 Signatures from Bag-of-Word Model

Our second approach to constructing computational sig-natures is based on bag-of-word model similar to [28].In a nutshell, this model is very similar to the word rep-resentation of a document: a time series trace is like adocument that consists of a set of representative words(states). While simple, bag-of-word model has been suc-cessfully applied to a variety of tasks in text mining [33],computer vision [34] and time series analysis [28].

The codebook size m is of importance to the bag-of-words representation. A compact codebook with too fewentries has a limited discriminative ability, while a largecodebook is likely to introduce noise. Unlike using 10-20states in the Markov model, we use 1000 codewords inthe bag-of-word model, which well balances the trade-offbetween discrimination and noise.

It is straightforward to construct the bag-of-word repre-sentation of a trace. We perform the same data preprocess-

6

Page 7: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

Figure 7: (Left plot) PCA analysis of baseline apps’ instruction mix. (3 right plots) Difference among 3 representative baselineapplications obtained with the Markov model by training on one app and testing on the rest two apps. Based on PCA analysis,we can conclude that the baseline apps form a diverse set. The Markov model justifies the behavioral difference observed onthe microarchitectural level.ing and state representation on the trace as in the Markovmodel, but with much larger number of codewords. Aftereach local segment of a trace is assigned a codeword, weignore the temporal order of local segments, and representthe trace by a vector of size m, the codebook size. The vec-tor contains the histogram of codewords in the trace, eachentry of which specifies the count of a codeword occurringin the trace.

With a set of vector representation of traces from ap-plication normal execution, we train a one-class SVMmodel [35] as the computation signature for the applica-tion. One-class SVM is a widely used novelty detectionalgorithm. Intuitively, it aims to find a minimum nonlinearregion that can enclose most of the normal observations,and classify the observations outside the region as anoma-lies.

5 Experimental SetupFigure 6 shows apps that we selected as targets to repack-

age malware samples into and the overall performancecounter statistics for each app. Our main goal was to chooseapps that represented popular usage, that require permis-sions to access SD card and other information with internetaccess, and that cover a mix of compute (games), user-driven (medical app, news), as well as network-oriented(radio) computational behaviors. Our chosen app set in-cludes native, Android, as well as web-based (web-views)functionality.

Devices. Our experimental setup consists of an Androiddevelopment board connected to a local machine via USBand a server used for fast data processing and construc-tion of ML models. The local machine uses a wirelessrouter to capture internet traffic generated by the devel-opment board. We use Samsung Exynos 5250 equippedwith a touch screen and a TI OMAP5430 developmentboard, rebooting the board between each experiment. Thedevelopment boards’ software is unstable – a board usuallyworked reliably for 9–10 hours – but sufficed for one longexperiment. We carried out all experiments on the Exynos5250 because some common apps like NYTimes and CNNcrashed on OMAP 5430 since it lacks a WiFi module, andrepeated Angry Birds experiments on OMAP 5430.

Data collection. For each benign application, e.g. Fire-fox, we created a workload that represents common users’behavior according to statistics available online. For exam-ples, when exercising Firefox, we visited popular websiteslisted on alexa.com. Figure 6 summarizes example user-level sessions for each app.

For each benign app, we collect 6 user-level sessions(each 5–11 min long) and use Android Reran [36] to recordand replay 4 of these sessions with random delays addedbetween recorded actions (while ensuring correct execu-tion of the app). These 10 user-level traces generate 56–111 minutes of performance counter traces across all apps.Each benign app is repackaged with 66 different malwaresamples. To collect performance counter traces for eachrepackaged app, we replay one of the app’s user-level tracesto execute the app to get 5–11 minutes long performancecounter trace for each of the 66 malware samples per app.

We treated entire traces produced by the runs of thebenign apps as benign. From malicious traces, we extractedonly the parts within which malware was active and labeledthem as malicious. We did not make any assumptionsregarding parts of the trace after malware completes itsactivity, and did not use them in our experiments. It isnot obvious how malware activity could perturb the restof the trace; for example, it may cause an extra garbagecollection.

Performance counter tracing. We used the ARM DS-5 v5.15 framework equipped with Streamline profiler asa non-intrusive way of observing performance countersand matching them with OS-level activities i.e. contextswitches, thread blocking events, etc. DS-5 is officiallysupported by ARM and it is able to work with multipleARM-based development boards. DS-5 Streamline readsdata every millisecond or on every context switch, andis thus able to ascribe performance events to individualprocesses and even threads. To reduce time required fordata processing within DS-5 Streamline framework, wetransferred data to the local server. One of the major short-comings of DS-5 Streamline is that extraction per processdata can be done only using its GUI. We automated thisprocess using JitBit [37] UI automation tool.

Choice of performance counters. We used process-specific counters that record dynamic instruction mix –loads, stores, integer instructions, immediate and indirectbranches – in addition to the total number of mispredictedbranches. We checked the validity of performance coun-ters readings obtained via DS-5 Streamline with speciallycrafted C programs, which we compiled and ran natively onthe boards. We collected counter information on a per pro-cess basis because matching programmer-visible threadsto Linux-level threads requires instrumenting the Androidmiddleware (i.e., is non-trivial), and per-application coun-ters yielded reasonable detection rates. We leave exploringthe optimal set of performance counters for future work.

7

Page 8: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

Figure 8: (Markov model) The accuracy of detecting real malware repackaged into baseline apps and Geinimi malware (thatcomes repackaged into MonkeyJump). Detection accuracy is reported as the number of true positives versus the number offalse positives computed in the phase-window scale.

Figure 9: (Markov Model) Angry Birds with click fraud operating at three (increasing) intensities. A low dark-line indicatesthat signature doesn’t match expected

Training and testing data. For each baseline applica-tion, we train the Markov model on 80% of benign tracesand use the remaining 20% of benign traces and all mali-cious traces for testing. As opposed to supervised learning,using the bag-of-words model for anomaly/outlier detec-tion requires that we set the approximate false positive rateand use all the benign traces for training. We then use allthe malicious traces for testing and report the true positiverate.

Ensuring correct execution. Individual tests were initi-ated by the local computer directly connected to the board.We used both automatic and manual checks to ensure thatthe malicious payload was executed on the board for eachtrace. When synthetic malware started or stopped execu-tion, it printed a message to a console that was accessed bythe local computer. Also, the local computer wrote a logof all initiated experiments, their parameters and responsesobtained from the board. Malware communicated witha C&C server that was running on a local computer andaccepted incoming connections. In the case of the syntheticmalware, we used Hercules_3-2-6 TCP server that was ableto automatically accept TCP connections.

For real malware experiments, we developed our ownHTTP server that supported custom duplex protocols im-plemented by malware developers and used encryptionif necessary. If we allowed malware to communicate toits original server, which was not under our control, wecaptured network traffic going through the router. Both syn-thetic and real malware were instrumented to make themsend messages to the local computer not only via console,but also to DS-5 Streamline.

6 Experimental ResultsWe perform extensive evaluation on both the Markov

model and the Bag-of-Word (BoW) model. Due to spacelimitation, we mainly report here our findings with Markovmodel, briefly summarize the results of the BoW model,and at the end present a boosting algorithm that can effi-ciently combine our classifiers to improve detection accu-

Figure 10: (Markov model) Phase-level true positive versusfalse positives: detection accuracy for synthetic malware.racy. We tune the parameters for the Markov model on aper-app basis in all experiments throughout the paper.6.1 Diversity of Baseline Apps

We first investigated whether the nine baseline applica-tions that we had chosen are sufficiently diverse from eachother. To quantatively analyze similarity of the apps, weapplied Principal Component Analysis (PCA) to the instruc-tion mix to visualize the distribution of the apps in a two-dimensional space, as well as the behavioral comparison ofapps in the feature space using the Markov model (Fig. 7).The PCA method projects the apps’ instruction mix onthe first two principal components that retain 90.77% ofvariability in the data. We see that the apps are evenlyscattered over the region and they do not form any clusters.This means that the programs we have chosen as underly-ing benign apps are diverse with respect to their low-levelcharacteristics.

The second method compares apps at the higher levelthan PCA analysis does. Due to space limitation, wepresent results of comparison only three apps: AngryBirds,Sana, and TuneInRadio (Fig. 7), because they representthe three broad categories, namely, computationally inten-sive, event-driven and network intensive apps on Androidplatform.

When comparing pairs of the apps, we treated one ofthem as baseline and the remaining two apps as novelones. Thus, we used the model built for the baseline app to

8

Page 9: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

Figure 11: (Markov model) Detection accuracy of syntheticmalware reported in terms of discovered malicious APKs(TP axis) versus the number of false positives meausred over30-sec time windows (FP axis).

App name Time to detect (ms) Signature

size (bytes) Min Max Median

Amazon 1920 2400 1920 704

AngryBirds 1280 1840 1280 2680

CNN 2400 3150 2400 8968

Firefox 1440 1800 1440 5124

G. Maps 2400 3300 2400 1668

G. Translate 2560 4000 2560 8708

Sana 3200 4400 3200 8256

TuneInRadio 2560 3680 2560 2948

ZombieWW 1920 2880 1920 708

Figure 12: (Markov model) Time to detection and signaturesizes for each baseline app.measure how much the dynamic behavior of the other appsdiverges from that of the baseline app. Results are reportedin Fig. 7 with standard axes: percentage of the true positivesversus percentage of the false positives in terms of phase-level windows, which is the same as the state_window wedefined in 4.2. The number of true positives characterizeshow the traces of the novel apps are different from thetraces of a selected baseline app. Resulting step-like curveshighlight the inherent dissimilarity between the apps andjustify our decision on choosing representative apps in ourtest suite. We also see that the dissimilarity measured insuch a way is not symmetric, and this fact comes from theprobabilistic nature of the Markov models and noise in theinput data.

Interestingly, this experiment also shows that it is rela-tively easy to differentiate apps using their counter traces.Differentiating repackaged apps from their own baselineapp, however, is a much more challenging proposition (asthe false positive results for repackaged malware demon-strate).6.2 Detecting Existing Malware

We evaluated HaCS using representative instances ofreal malware chosen from three malware categories: in-formation stealers (Geinimi.a, Roidsec), network-intensivemalware (click fraud), and computationally intensive mal-ware (md5 cracker). We obtained an instance of Geinimi.athat was already repackaged into the game MonkeyJump2.We reverse-engineered all other malware samples and em-

bedded them into Angry Birds, Sana and TuneIn Radio.Md5 cracker was not originally positioned as a malware;it is freely available through Google Play market. Intotal, we conducted 10 experiments with real malware sam-ples (Fig. 8), which demonstrate the capability of HaCSin detecting different categories of malware embedded ininherently different types of apps.

Comparing the click fraud’s detectability relative to othermalware types, we see that click fraud has higher chancesof being detected within network-light apps (Angry Birdsand Sana). However, we notice that detectability growthrate of click fraud is very low in the interval [3%, 10%].Thus, we think that it can partially hide itself within thenetwork-intensive app (TuneIn Radio).

The other malware, Roidsec, demonstrates the worstdetectability in all three experiments because its activitybursts are separated by dormancy intervals whose lengthis larger than the length of activity intervals. In spite ofstealing many types of sensitive data, only in two casesit exhibits computationally intensive behavior: stealingcontacts and text messages. In the other cases, Roidsecaccesses private data without incurring noticeable overhead(e.g., Wi-Fi status, ROM/SDcard info, etc.) and uploadsit to the C&C server. After averaging the detectability ofindividual malicious operations carried out by Roidsec, weobtained the lowest detectability on average.

The detection rate of the third malware sample, md5cracker, mostly lies between the detectability of click fraudand roidsec, if looked at a reasonable false-positive interval[0%, 15%]. This is because it is more computationallyintensive than roidsec, but not as much as click fraud. Md5cracker slowly brute forces md5 hashes using just a singlethread. Due to such ineffective nature, it does not impactthe traces of AngryBirds and Sana. However, md5 crackerdemonstrates good detectability in the case of TuneInRadio,which can be explained by noticing a regular structure ofTuneInRadio traces. TuneInRadio activity occurs withinevenly spaced bursts, and they resemble the positive part ofsin(x) function. This fact may be a result of buffering dataand processing data in chunks. Between processing chunksof the incoming data, TuneInRadio’s activity reaches verylow values. When md5 cracker overlaps TuneInRadio, itsignificantly changes the structure of TuneInRadio’s trace.

The most complex sample in our dataset is Geinimi.a;it mainly functions as infostealer, but also includes otherfunctionality (e.g., the ability to install third-party apps)and thus open doors for other attacks. We conducted twoseries of experiments for Geinimi.a sample that is alreadyrepackaged into MonkyJump2 game: MonkeyJump2 withits malicious payload enabled and disabled. To disableGeinimi.a, we reverse-engineered the apk and modifiedGeinimi.a’s byte-code, then compiled the app again. Dueto computationally intensive behavior of MonkeyJump2game, we did not achieve 100% detection rate as in thecase with other malware samples (Fig. 8). Our resultsobtained for Geinimi.a malware are similar to the resultsfor AngryBirds with embedded Roidsec: neither allows a

9

Page 10: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

App Name Undetected Malware

Payload Delay

AngryBirds 1 photo 25, 70 contacts 200, 400 SMSs Phone ID and GPS SHA1 10K SHA1 500k, 1.5M

0ms 0ms 0ms 0ms, 200ms 0ms, 20ms 0ms

Sana MIT Medical

1 photo 25 contacts Phone ID and GPS SHA1 10K

0ms, 5000ms 0ms, 10ms 0ms, 200ms 0ms, 40ms

TuneIn Radio SHA1 10K 0ms

Figure 13: (Markov model) Malware samples repackaged inAngryBirds, Sana and TuneInRadio that are undetected byour pipeline. All remaining synthetic malware repackaged inthese 3 apps can be detected reliably.

Malware Amazon AngryBirds G. Translate Sana TuneInRadio

% # TTD % # TTD % # TTD % # TTD % # TTD

False Positives 18 3243 21 3075 29 2387 19 4453 38 3121

True Positives 78 6494 100 6790 90 7362 80 6869 90 6642

Pictures 79 930 100 997 87 1148 73 924 94 893

Contacts 93 716 100 748 100 897 100 860 97 825

SMS 85 923 100 1107 100 1373 90 1200 88 1193

IDs+GPS 0 2 50 2 50 2 50 2 50 2

ClickFraud 81 1145 100 1077 91 1100 82 1022 82 1062

DDoS 58 1114 100 1118 57 1120 35 1122 83 1056

md5 80 1665 100 1742 100 1723 95 1740 96 1612

Figure 14: (Bag-of-words) Detection accuracy and false posi-tive rates for synthetic malwarehigh level of detectability.6.3 Detecting Synthetic Malware

We could not find existing malware samples on Androidplatform that allow fine-tuning the intensity of their ac-tivities. To conduct experiments with malicious activitiesat different intensity levels, we generated 66 configura-tion files that specify different parameter settings for seventypes of malware to control their intensity (Fig. 5). We thenconducted 66 experiments with each of the 9 benign appsin our evaluation (Fig. 6).

We summarize hard to detect malware samples in Fig. 13.Only 20 samples out of 66*3 = 198 remain undetected,and the combination of payload to the baseline app offerinsights into what can and what cannot be detected. Forinstance, Angry Birds is a computationally intensive gamethat allows malware to steal 400 SMSs at a time, whereasTuneIn Radio is much more sensitive – even stealing asingle photo or 25 SMSs triggers a true positive.

Overall, we find that the current approach works best forapps with predictable (simple or regular) behavior. The toptwo curves correspond to Amazon and TuneInRadio. Un-like other apps, Amazon does not exhibit complex dynamicbehavior. While TuneInRadio has complex behavior, it ishighly regular and looks like the f (x) = mod(sin(x)) func-tion. So, even small perturbations can be detected very well.While one may think that CNN is conceptually similar toAmazon, we do not achieve as good detection for CNN

Figure 15: (Markov model) Effect of obfuscation and encryp-tion on detection rate: interestingly, malware becomes moredistinct compared to baseline benign app.as for Amazon, because CNN’s dynamic behavior variesquite a bit: CNN can provide text news, show pictures andplay videos, display static and animated ads. The worstresults are demonstrated by the game Zombie WorldWar; itis a computationally intensive application whose dynamicbehavior is hard to learn.

Detection rates of the rest apps are grouped togetheraround Sana’s curve. One may expect that Sana shouldleave a highly predictable signature due to the simplicity ofits user interface, but this is not the case because user eventscause short and sharp spikes within the traces, which couldnot be reliably captured by the Markov model. For mostapplications, Markov model can reliably detect from 10%to 50% of malicious phases while not exceeding 1% of falsepositives. The two outliers are Amazon and Google Maps,the former has very high detectability, while detectabilityof the latter one grows slowly till the 4% false positivemark and then increases its rate.

We also experiment with transition from near real-timedetection to more coarse grained time scale. Fig. 11 showsapk-level true positive rate (the number of detected mali-cious apks over the total number of malicious apks) vs thefalse positive rate computed over the 30-sec time windows.We see that the detection to false positive ratio is muchbetter in the case v. when we measure detection in termsof phase windows (Fig. 10). This finding means that de-tection improves given a longer time scale. However, falsepostives tend to cluster together so that the rate is betterwith 30s windows than at the phase level.6.4 Reflection

Reflection is a powerful method for writing maliciouscode that is meant to escape from static code analyzers.Java methods invoked via reflection are resolved in run-time, making it hard for static code analysis to understandthe program’s semantics. At the same time, pure reflectionis not enough to protect against static analysis of the code —it must be accompanied by encryption of strings, otherwisethe invoked method or a set of possible methods mightbe resolved statically. We augmented synthetic malwarewith reflection and encryption in a way that is similar toGeinimi.a’s implementation. Static analysis of our syn-thetic code does not reveal any API methods that mightallow static tools to raise alarms. We checked this using

10

Page 11: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

virustotal.com online service, which ran 38 antiviruses onour apk, and we did not receive any warnings. The result ofrunning our malware detector on the 66 synthetic malwaresamples augmented with reflection and encryption and em-bedded into AngryBirds, Sana, TuneInRadio is shown onthe Fig. 15. We see that in the case of AngryBirds and Sanathe detection rate of the malware that uses both reflectionand encryption is significantly higher because reflectionand encryption are computationally intensive and disturbthe trace of a benign app more than the same malwarewithout reflection and encryption. We do not see exactlythe same trend for TuneInRadio because its basic detectionrate is quite high, so the results with TuneInRadio staywithin the error range. We can conclude that reflection andaccompanying encryption increase malware detectabilityin average case.6.5 Results for Bag-of-Word Model

We perform extensive evaluation for Bag-of-Wordmodel, and report here its capability in detecting syntheticmalware embedded in Android apps. Different from theonline detection in the Markov model, Bag-of-Word modelworks with a predefined Time-to-Detection (TTD) window.We perform study for a variety of TTDs, and report inFig. 14 the results for TTD = 1500 ms.

We use a Gaussian kernel with bandwidth g for the one-class SVM, and set the ν parameter for the one-class SVMso that it has around 20% percent of false positives. We usea grid-search approach to find the g and ν to give us thebest true positive rate on malicious activities with around20% of false positives.

From Fig. 14, we see our method: 1) achieves surpris-ingly high true positive rate for both AngryBrids (99.9%)and Google Translate (92.4%) apps with around 20% offalse positive rate; 2) achieves around 80% true positiverate for both Amazon and Sana with around 20% of falsepositive rate. We could not control the error rate for TuneIn-Radio; nevertheless, our method achieves 90% true positiverate with 38% false positive rate for TuneInRadio app.

7 Related WorkAndroid Security. Zhou et al. [14] presents a compre-

hensive survey of Android malware. The authors analyzeand classify more than 1,200 Android malware samplesusing a variety of criteria, and estimate that 86% out of1,200 samples they analyzed were repackaged. The straight-forward approach for detecting repackaged malware is todevelop an algorithm that can do pairwise comparison ofapps to find the original app. DNADroid [38] extractsa program dependency graph (PDG) from Android appsand compares them in terms of PDGs. DroidMOSS [39]applies fuzzy hashing to app comparison problem. Jux-tapp [40] uses k-grams of opcode sequences of Androidapps and feature-hashing approach. One disadvantage ofsuch algorithms is the lack of scalability. PiggyApp [41]proposes a decoupling technique to separate primary andpiggybacked modules, feature fingerprint to extract seman-tic features, and a linear arithmetic search algorithm to deal

with scalability issues.Another set of security tools analyzes the semantics

of Android apps though their dynamic behavior. Droid-scope [42] reconstructs OS-level and Java-level semanticsof Android apps and is a platform for developing high-leveltools for security purposes. Tools exist that enforce secu-rity policies within Android apps at runtime, e.g., Taint-Droid [8]. Aurasium [43] repackages existing apps byattaching user-level sandboxing and policy-enforcementcode. AppFence [44] substitutes shadow data for user-sensitive data and blocks network transmissions that con-tain on-device use only data.

Triceratops [45] proposes hardening of Android apps viasplitting an app into verified and unverified regions throughconservative static analysis. Then it patches the unverifiedregion with code based on the supplied security policy. Theinserted code is used for dynamic checks whether the appcomplies with the imposed security policy. AppInk [46]utilizes dynamic graph-based watermarking that insertstransparent watermarking into an app and generates a man-ifest app that can recognize watermarks and detect codemodification. Stowaway [17] tool maps API calls to re-quested permission and thus detects overprivileged apps.

Hardware computational signals. The most relevantpaper to our project is the work by Demme et al. [12].Their malware-signature detection technique is best suitedto detect known malware samples and complements ouranomaly-based approach that models benign apps. Ourevaluation methodology has a very different focus. In-stead of executing off-the-shelf malware samples, we stresstest our anomaly detector using a) a diverse set of extantand synthetic malware whose payload is confirmed to ex-ecute correctly so that we can know which payloads aredetectable, and b) comparing repackaged malware exe-cutions with those of their baseline app to evaluate falsepositives. Power traces are a complementary hardwarecomputational signal [47, 48, 49] to analyze energy con-sumption of mobile devices and medical software and tryto detect malicious activity.

8 ConclusionsMobile malware often comes repackaged inside popular

baseline apps or through compromises of vulnerable apps.In this paper, we evaluate the potential of hardware compu-tational signals to detect new mobile malware (i.e. whosesignatures don’t exist). We propose HaCS, a system thatexecutes a diverse set of benign apps and repackaged mal-ware and can enable evaluation of new detection schemes.In particular, we evaluate an anomaly detector that learnsbaseline apps’ computational signatures and detects mal-ware as anomalies. We conduct a thorough study and findthat even complex apps have learnable signatures, and thatevents that are relatively minor at the app level, like steal-ing a few tens of photos, create distinctive anomalies in theapp’s measured signature. HaCS’ evaluation also showsthat false positives are a concern, motivating future workinto integrating OS and middleware based signals.

11

Page 12: HaCS: Evaluating Hardware Computational …users.ece.utexas.edu/~tiwari/pubs/hacs-usenix-14...HaCS: Evaluating Hardware Computational Signatures for Mobile Malware Detection Abstract

References[1] Kaspersky security bulletin 2013. http://media.kaspersky.com/

pdf/KSB_2013_EN.pdf.[2] Risk iq press release. http://www.riskiq.com/company/press-

releases/riskiq-reports-malicious-mobile-apps-google-play-have-spiked-nearly-400.

[3] Cisco annual security report 2014. http://www.cisco.com/web/offers/lp/2014-annual-security-report/index.html.

[4] Trendlabs behind the android menace. http://blog.trendmicro.com/trendlabs-security-intelligence/infographic-behind-the-android-menace-malicious-apps.

[5] Trendlabs a look at google bouncer. http://blog.trendmicro.com/trendlabs-security-intelligence/a-look-at-google-bouncer.

[6] Android is almost impenetrable to malware. http://qz.com/131436/contrary-to-what-youve-heard-android-is-almost-impenetrable-to-malware.

[7] Dissecting android’s bouncer. https://www.duosecurity.com/blog/dissecting-androids-bouncer.

[8] William Enck, Peter Gilbert, Byung-Gon Chun, Landon P. Cox,Jaeyeon Jung, Patrick McDaniel, and Anmol N. Sheth. Taint-Droid: an information-flow tracking system for realtime privacymonitoring on smartphones. 2010.

[9] Erika Chin and David Wagner. Bifocals: Analyzing webviewvulnerabilities in android applications.

[10] Vulnerable & aggressive adware. http://www.fireeye.com/blog/technical/2013/10/ad-vulna-a-vulnaggressive-vulnerable-aggressive-adware-threatening-millions.html.

[11] Arm trustzone. http://www.arm.com/products/processors/technologies/trustzone/index.php.

[12] J. Demme, M. Maycock, J. Schmitz, A. Tang, A. Waksman,A. Aethumadhavan, and S. Stolfo. On the feasibility of onlinemalware detection with performance counters. In Proceeding ISCA

’13 Proceedings of the 40th Annual International Symposium onComputer Architecture, pages 559–570, 2013.

[13] Malware life-span. http://www.fireeye.com/blog/corporate/2014/05/ghost-hunting-with-anti-virus.html.

[14] Y. Zhou and X. Jiang. Dissecting android malware: Characteriza-tion and evolution. In Proceeding SP ’12 Proceedings of the 2012IEEE Symposium on Security and Privacy, pages 95–109, 2012.

[15] Mobile malware database. http://contagiominidump.blogspot.com.[16] Malware database. http://virusshare.com.[17] Adrienne Porter Felt, Erika Chin, Steve Hanna, Dawn Song, and

David Wagner. Android permissions demystified. pages 627–638,2011.

[18] Franziska Roesner, Tadayoshi Kohno, Alexander Moshchuk, BryanParno, Helen J. Wang, and Crispin Cowan. User-driven accesscontrol: Rethinking permission granting in modern operating sys-tems. In Proceedings of the 2012 IEEE Symposium on Securityand Privacy, SP ’12, pages 224–238, Washington, DC, USA, 2012.IEEE Computer Society.

[19] Android malware genome project. http://www.malgenomeproject.org.

[20] Adrian Tang, Simha Sethumadhavan, and Salvatore J. Stolfo. Un-supervised anomaly-based malware detection using hardware fea-tures. 2014.

[21] Malware database. http://malware.lu.[22] Universal android rooting procedure (rage method).

http://theunlockr.com/ 2010/10/26/universal-android-rooting-procedure-rage-method/.

[23] Gingerbreak apk root. http://droidmodderx.com/ gingerbreak-apk-root-your-gingerbread-device.

[24] Exploid. http://forum.xda-developers.com/showthread.php?t=739874.

[25] Mobile bitcoin miner. https://blog.lookout.com/blog/2014/04/24/badlepricon-bitcoin.

[26] Mikhail Kazdagli, Ling Huang, Vijay Reddi, and Mohit Tiwari.Morpheus: Benchmarking computational diversity in mobile mal-ware. In Workshop on Hardware and Architectural Support forSecurity and Privacy, 2014.

[27] T. Huffmire and T. Sherwood. Wavelet-based phase classification.In Proceedings of the 15th international conference on Parallelarchitectures and compilation tech- niques (PACT), 2006.

[28] Jin Wang, Ping Liu, Mary F.H.She, and Saeid Nahavandi. Bag-of-words representation for biomedical time series classification.Biomedical Signal Processing and Control, 8(6), 2013.

[29] Inan Gulera and Elif Derya Ubeyli. ECG beat classifier designedby combined neural network model. Pattern Recognition, 38(2),2005.

[30] Dan Pelleg and Andrew W. Moore. X-means: Extending k-meanswith efficient estimation of the number of clusters. In Proceedingsof the 7th International Conference on Machine Learning, 2000.

[31] Timothy Sherwood, Erez Perelman, Greg Hamerly, and BradCalder. Automatically characterizing large scale program behavior.SIGOPS Oper. Syst. Rev.

[32] Probability and Random Processes. Oxford University Press, 1992.[33] D. Sculley and G. M. Wachman. Relaxed online svms for spam

filtering. In SIGIR, 2007.[34] J. Sivic and A. Zisserman. Video google: A text retrieval approach

to object matching in videos. In ICCV, 2003.[35] Pattern Recognition and Machine Learning. Springer, 2006.[36] Record and replay for android. http://www.androidreran.com.[37] Jitbit macro recorder. http://www.jitbit.com/.[38] Jonathan Crussell, Clint Gibler, and Hao Chen. Attack of the

clones: Detecting cloned applications on android markets. In SaraForesti, Moti Yung, and Fabio Martinelli, editors, Computer Secu-rity – ESORICS 2012, volume 7459 of Lecture Notes in ComputerScience, pages 37–54. Springer Berlin Heidelberg, 2012.

[39] Wu Zhou, Yajin Zhou, Xuxian Jiang, and Peng Ning. Detectingrepackaged smartphone applications in third-party android mar-ketplaces. In CODASPY ’12 Proceedings of the second ACMconference on Data and Application Security and Privacy, pages317–326, 2012.

[40] Steve Hanna, Ling Huang, Edward Wu, Saung Li, Charles Chen,and Dawn Song. Juxtapp: A scalable system for detecting codereuse among android applications. In Ulrich Flegel, EvangelosMarkatos, and William Robertson, editors, Detection of Intrusionsand Malware, and Vulnerability Assessment, volume 7591 of Lec-ture Notes in Computer Science, pages 62–81. Springer BerlinHeidelberg, 2013.

[41] Wu Zhou, Yajin Zhou, Michael Grace, Xuxian Jiang, and ShihongZou. Fast, scalable detection of "Piggybacked" mobile applications.In CODASPY ’13 Proceedings of the third ACM conference onData and application security and privacy, pages 185–196, 2013.

[42] Lok Kwong Yan and Heng Yin. DroidScope: Seamlessly recon-structing the OS and Dalvik semantic views for dynamic Androidmalware analysis. 2012.

[43] Rubin Xu, Hassen Saïdi, and Ross Anderson. Aurasium: Practicalpolicy enforcement for Android applications. 2012.

[44] Peter Hornyack, Seungyeop Han, Jaeyeon Jung, Stuart Schechter,and David Wetherall. These aren’t the droids you’re looking for:Retrofitting Android to protect data from imperious applications.pages 639–652, 2011.

[45] Ravi Bhoraskar and Edward Xuejun Wu. Triceratops: Securingmobile apps.

[46] Wu Zhou, Xinwen Zhang, and Xuxian Jiang. Appink: watermark-ing android apps for repackaging deterrence. In ASIA CCS ’13Proceedings of the 8th ACM SIGSAC symposium on Information,computer and communications security, pages 1–12, 2013.

[47] H. Kim, J. Smith, and K. Shin. Detecting Energy-greedy anoma-lies and mobile malware variants. In Proceeding MobiSys ’08Proceedings of the 6th international conference on Mobile systems,applications, and services, pages 239–252, 2008.

[48] L. Liu, G. yan, X. Zhang, and S. Chen. VirusMeter: PreventingYour Cellphone from Spies. In Proceeding RAID ’09 Proceed-ings of the 12th International Symposium on Recent Advances inIntrusion Detection, pages 244–264, 2009.

[49] S. Clark, B. Ransford, A. Rahmati, S. Guineau, J. Scober, K. Fu,and Weyuan Xu. WattsUpDoc: Power Side Channels to Nonintru-sively discover untargeted malware on embedded medical devices.In Proceedings of USENIX Workshop on Health Information Tech-nologies, 2013, 2013.

12