whyper: towards automating risk assessment of mobile ... · usenix association 22nd usenix security...

Open access to the Proceedings of the 22nd USENIX Security Symposium

is sponsored by USENIX

This paper is included in the Proceedings of the 22nd USENIX Security Symposium.August 14–16, 2013 • Washington, D.C., USA

ISBN 978-1-931971-03-4

WHYPER: Towards Automating Risk Assessment of Mobile Applications

Rahul Pandita, Xusheng Xiao, Wei Yang, William Enck, and Tao Xie, North Carolina State University

USENIX Association 22nd USENIX Security Symposium 527

WHYPER: Towards Automating Risk Assessment of Mobile Applications

Rahul Pandita, Xusheng Xiao, Wei Yang, William Enck, Tao XieNorth Carolina State University, Raleigh, NC, USA

{rpandit, xxiao2, wei.yang}@ncsu.edu {enck, xie}@csc.ncsu.edu

AbstractApplication markets such as Apple’s App Store and

Google’s Play Store have played an important role in thepopularity of smartphones and mobile devices. However,keeping malware out of application markets is an ongo-ing challenge. While recent work has developed varioustechniques to determine what applications do, no workhas provided a technical approach to answer, what dousers expect? In this paper, we present the first step inaddressing this challenge. Specifically, we focus on per-missions for a given application and examine whether theapplication description provides any indication for whythe application needs a permission. We present WHY-PER, a framework using Natural Language Processing(NLP) techniques to identify sentences that describe theneed for a given permission in an application description.WHYPER achieves an average precision of 82.8%, andan average recall of 81.5% for three permissions (addressbook, calendar, and record audio) that protect frequently-used security and privacy sensitive resources. These re-sults demonstrate great promise in using NLP techniquesto bridge the semantic gap between user expectations andapplication functionality, further aiding the risk assess-ment of mobile applications.

1 Introduction

Application markets such as Apple’s App Store andGoogle’s Play Store have become the de facto mech-anism of delivering software to consumer smartphonesand mobile devices. Markets have enabled a vibrant soft-ware ecosystem that benefits both consumers and devel-opers. Markets provide a central location for users todiscover, purchase, download, and install software withonly a few clicks within on-device market interfaces. Si-multaneously, they also provide a mechanism for devel-opers to advertise, sell, and distribute their applications.Unfortunately, these characteristics also provide an easy

distribution mechanism for developers with malicious in-tent to distribute malware.

To address market-security issues, the two predom-inant smartphone platforms (Apple and Google) usestarkly contrasting approaches. On one hand, Appleforces all applications submitted to its App Store to un-dergo some level of manual inspection and analysis be-fore they are published. This manual intervention allowsan Apple employee to read an application’s descriptionand determine whether the different information and re-sources used by the application are appropriate. On theother hand, Google performs no such checking beforepublishing an application. While Bouncer [1] providesstatic and dynamic malware analysis of published appli-cations, Google primarily relies on permissions for se-curity. Application developers must request permissionsto access security and privacy sensitive information andresources. This permission list is presented to the user atthe time of installation with the implicit assumption thatthe user is able to determine whether the listed permis-sions are appropriate.

However, it is non-trivial to classify an application asmalicious, privacy infringing, or benign. Previous workhas looked at permissions [2–5], code [6–10], and run-time behavior [11–13]. However, underlying all of thiswork is a caveat: what does the user expect? Clearly, anapplication such as a GPS Tracker is expected to recordand send the phone’s geographic location to the network;an application such as a Phone-Call Recorder is expectedto record audio during a phone call; and an applicationsuch as One-Click Root is expected to exploit a privilege-escalation vulnerability. Other cases are more subtle.The Apple and Google approaches fundamentally differin who determines whether an application’s permission,code, or runtime behavior is appropriate. For Apple, it isan employee; for Google, it is the end user.

We are motivated by the vision of bridging the seman-tic gap between what the user expects an application todo and what it actually does. This work is a first step in

528 22nd USENIX Security Symposium USENIX Association

this direction. Specifically, we focus on permissions andask the question, does the application description pro-vide any indication for the application’s use of a permis-sion? Clearly, this hypothesis will work better for somepermissions than others. For example, permissions thatprotect a user-understandable resource such as the ad-dress book, calendar, or microphone should be discussedin the application description. However, other low-levelsystem permissions such as accessing network state andcontrolling vibration are not likely to be mentioned. Wenote that while this work primarily focuses on permis-sions in the Android platform and relieving the strain onend users, it is equally applicable to other platforms (e.g.,Apple) by aiding the employee performing manual in-spection.

With this vision, in this paper, we present WHY-PER, a framework that uses Natural Language Processing(NLP) techniques to determine why an application usesa permission. WHYPER takes as input an application’sdescription from the market and a semantic model of apermission, and determines which sentence (if any) inthe description indicates the use of the permission. Fur-thermore, we show that for some permissions, the per-mission semantic model can be automatically generatedfrom platform API documents. We evaluate WHYPERagainst three popularly-used permissions (address book,calendar, and record audio) and a dataset of 581 popularapplications. These three frequently-used permissionsprotect security and privacy sensitive resources. Our re-sults demonstrate that WHYPER effectively identifies thesentences that describe needs of permissions with an av-erage precision of 82.8% and an average recall of 81.5%.We further investigate the sources of inaccuracies anddiscuss techniques of improvement.

This paper makes the following main contributions:

• We propose the use of NLP techniques to helpbridge the semantic gap between what mobile ap-plications do and what users expect them to do. Tothe best of our knowledge, this work is the first at-tempt to automate this inference.

• We evaluate our framework on 581 popular An-droid application descriptions containing nearly10,000 natural-language sentences. Our evaluationdemonstrates substantial improvement over a basickeyword-based searching.

• We provide a publicly available prototype imple-mentation of our approach on the project web-site [14].

WHYPER is merely the first step in bridging the se-mantic gap of user expectations. There are many waysin which we see this work developing. Application de-scriptions are only one form of input. We foresee pos-

sibility in also incorporating the application name, userreviews, and potentially even screen-shots. Furthermore,permissions could potentially be replaced with specificAPI calls, or even the results of dynamic analysis. Wealso see great potential in developing automatic or par-tially manual techniques of creating and fine-tuning per-mission semantic models.

Finally, this work dovetails nicely with recent dis-course concerning the appropriateness of Android per-missions to protect user security [15–18]. The over-whelming evidence indicates that most users do not un-derstand what permissions mean, even if they are in-clined to look at the permission list [18]. On the otherhand, permission lists provide a necessary foundation forsecurity. Markets cannot simultaneously cater to the se-curity and privacy requirements of all users [19], and per-mission lists allow researchers and expert users to be-come “whistle blowers” for security and privacy con-cerns [11]. In fact, a recent comparison [20] of theAndroid and iOS versions of applications showed thatiOS applications overwhelmingly more frequently useprivacy-sensitive APIs. Tools such as WHYPER can helpraise awareness of security and privacy problems andlower the sophistication required for concerned users totake control of their devices.

The remainder of this paper proceeds as follows. Sec-tion 2 presents the overview of the WHYPER frameworkalong with background on NLP techniques used in thiswork. Section 3 presents our framework and implemen-tation. Section 4 presents evaluation of our framework.Section 6 discusses related work. Finally, Section 7 con-cludes.

2 WHYPER Overview

We next present a brief overview of the WHYPER frame-work. The name WHYPER itself is a word-play on phrasewhy permissions. We envision WHYPER to operate be-tween the application market and end users, either as apart of the application market or a standalone system asshown in Figure 1.

The primary goal of the WHYPER framework is tobridge the semantic gap of user expectations by deter-mining why an application requires a permission. Inparticular, we use application descriptions to get this in-formation. Thus, the WHYPER framework operates be-tween the application market and end users. Further-more, our framework could also serve to help developerswith the feedback to improve their applications, as shownby the dotted arrows between developers and WHYPERin Figure 1.

A straightforward way of realizing the WHYPERframework is to perform a keyword-based search on ap-plication descriptions to annotate sentences describing

2


Figure 1: Overview of WHYPER

sensitive operations pertinent to a permission. However,we demonstrate in our evaluation that such an approachis limited by producing many false positives. We pro-pose Natural Language Processing (NLP) as a meansto alleviate the shortcomings of keyword-based search-ing. In particular, we address the following limitationsof keyword-based searching:

1. Confounding Effects. Certain keywords such as“contact” have a confounding meaning. For in-stance, ‘... displays user contacts, ...’ vs ‘... contactme at [email protected]’. The first sentence fragmentrefers to a sensitive operation while the second frag-ment does not. However, both fragments include thekeyword “contact”.

To address this limitation, we propose NLP as ameans to infer semantics such as whether the wordrefers to a resource or a generic action.

2. Semantic Inference. Sentences often describe asensitive operation such as reading contacts with-out actually referring to keyword “contact”. For in-stance, “share... with your friends via email, sms”.The sentence fragment describes the need for read-ing contacts; however the “contact” keyword is notused.

To address this limitation, we propose to use APIdocuments as a source of semantic information foridentifying actions and resources related to a sensi-tive operation.

To the best of our knowledge, ours is the first frame-work in this direction. We next present the key NLP tech-niques used in this work.

2.1 NLP PreliminariesAlthough well suited for human communication, it isvery difficult to convert natural language into unambigu-

ous specifications that can be processed and understoodby computers. With recent research advances in the areaof NLP, existing NLP techniques have been shown to befairly accurate in highlighting grammatical structure of anatural language sentence. We next briefly introduce thekey NLP techniques used in this work.

Parts Of Speech (POS) Tagging [21, 22]. Alsoknown as ‘word tagging’, ‘grammatical tagging’ and‘word-sense disambiguation’, these techniques aim toidentify the part of speech (such as nouns and verbs)a particular word in a sentence belongs to. Currentstate-of-the-art approaches have been shown to achieve97% [23] accuracy in classifying POS tags for well-written news articles.

Phrase and Clause Parsing. Also known as chunk-ing, this technique further enhances the syntax of a sen-tence by dividing it into a constituent set of words (orphrases) that logically belong together (such as a NounPhrase and Verb Phrase). Current state-of-the-art ap-proaches can achieve around 90% [23] accuracy in clas-sifying phrases and clauses over well-written news arti-cles.

Typed Dependencies [24,25]. The Stanford-typed de-pendencies representation provides a hierarchical seman-tic structure for a sentence using dependencies with pre-cise definitions of what each dependency means.

Named Entity Recognition [26]. Also known as‘entity identification’ and ‘entity extraction’, these tech-niques are a subtask of information extraction that aimsto classify words in a sentence into predefined categoriessuch as names, quantities, and expressions of time.

We next describe the threat model that we consideredwhile designing our WHYPER framework.

2.2 Use Cases and Threat Model

WHYPER is an enabling technology for a number of usecases. In its simplest form, WHYPER could enable anenhanced user experience for installing applications. Forexample, the market interface could highlight the sen-tences that correspond to a specific permission, or raisewarnings when it cannot find any sentence for a permis-sion. WHYPER could also be used by market providersto help force developers to disclose functionality to users.In its primitive form, market providers could use WHY-PER to ensure permission requests have justifications inthe description. More advanced versions of WHYPERcould also incorporate the results of static and dynamicapplication analysis to ensure more semantically appro-priate justifications. Such requirements could be placedon all new applications, or iteratively applied to exist-ing applications by automatically emailing developersof applications with unjustified permissions. Alterna-tively, market providers and security researchers could

3


use WHYPER to help triage markets [5] for dangerousand privacy infringing applications. Finally, WHYPERcould be used in concert with existing crowd-sourcingtechniques [27] designed to assess user expectations ofapplication functionality. All of these use cases haveunique threat models.

For the purposes of this paper, we consider the genericuse scenario where a human is notified by WHYPER ifspecific permissions requested by an application are notjustified by the application’s textual description. WHY-PER is primarily designed to help identify privacy in-fringements in relatively benign applications. How-ever, WHYPER can also help highlight malware that at-tempts to sneak past consumers by adding additional per-missions (e.g., to send premium-rate SMS messages).Clearly, a developer can lie when writing the applica-tion’s description. WHYPER does not attempt to detectsuch lies. Instead, we assume that statements describ-ing unexpected functionality will appear out-of-place forconsumers reading an application description before in-stalling it. We note that malware may still hide maliciousfunctionality (e.g., eavesdropping) within an applicationdesigned to use the corresponding permission (e.g., anapplication to take voice notes). However, false claimsin an application’s description can provide justificationfor removal from the market or potentially even criminalprosecution. WHYPER does, however, provide defenseagainst developers that simply include a list of keywordsin the application description (e.g., to aid search rank-ings).

Fundamentally, WHYPER identifies if there a possibleimplementation need for a permission based on the ap-plication’s description. In a platform such as Android,there are many ways to accomplish the same implemen-tation goals. Some implementation options require per-missions, while others do not. For example, an appli-cation can make a call that starts Android’s address bookapplication to allow the user to select a contact (requiringno permission), or it can access the address book directly(requiring a permission). From WHYPER’s perspective,it only matters that privileged functionality may work ex-pectedly when using the application, and that functional-ity is disclosed in the application’s description. Othertechniques (e.g., Stowaway [28]) can be used to deter-mine if an application’s code actually requires a permis-sion.

3 WHYPER DesignWe next present our framework for annotating the sen-tences that describe the needs for permissions in ap-plication descriptions. Figure 2 gives an overview ofour framework. Our framework consists of five compo-nents: a preprocessor, an NLP Parser, an intermediate-

Figure 2: Overview of WHYPER framework

representation generator, a semantic engine (SE), and ananalyzer.

The pre-processor accepts application descriptionsand preprocesses the sentences in the descriptions, suchas annotating sentence boundaries and reducing lexicaltokens. The intermediate-representation generator ac-cepts the pre-processed sentences and parses them us-ing an NLP parser. The parsed sentences are then trans-formed into the first-order-logic (FOL) representation.SE accepts the FOL representation of a sentence and an-notates the sentence based on the semantic graphs of per-missions. Our semantic graphs are derived by analyzingAndroid API documents. We next describe each compo-nent in detail.

3.1 Preprocessor

The preprocessor accepts natural-language applicationdescriptions and preprocesses the sentences, to be fur-ther analyzed by the intermediate-representation gener-ator. The preprocessor annotates sentence boundaries,and reduces the number of lexical tokens using seman-tic information. The reduction of lexical tokens greatlyincreases the accuracy of the analysis in the subsequentcomponents of our framework. In particular, the prepro-cessor performs following preprocessing tasks:

1. Period Handling. In simplistic English, the char-acter period (‘.’) marks the end of a sentence. However,there are other legal usages of the period such as: (1)decimal (periods between numbers), (2) ellipsis (threecontinuous periods ‘...’), (3) shorthand notations (“Mr.”,“Dr.”, “e.g.”). While these are legal usages, they hinderdetection of sentence boundaries, thus forcing the sub-

4


sequent components to return incorrect or imprecise re-sults.

We pre-process the sentences by annotating these us-ages for accurate detection of sentence boundaries. Weachieve so by looking up known shorthand words fromWordNet [29] and detecting decimals, which are also theperiod character, by using regular expressions. From animplementation perspective, we have maintained a staticlookup table of shorthand words observed in WordNet.

2. Sentence Boundaries. Furthermore, there are in-stances where an enumeration list is used to describefunctionality, such as “The app provides the followingfunctionality: a) abc..., b) xyz... ”. While easy for ahuman to understand the meaning, it is difficult from amachine to find appropriate boundaries.

We leverage the structural (positional) information:(1) placements of tabs, (2) bullet points (numbers, char-acters, roman numerals, and symbols), and (3) delimiterssuch as “:” to detect appropriate boundaries. We furtherimprove the boundary detection using the following pat-terns we observe in application descriptions:

• We remove the leading and trailing ‘*’ and ‘-’ char-acters in a sentence.

• We consider the following characters as sentenceseparators: ‘–’, ‘- ’, ‘ø’, ‘§’, ‘†’, ‘�’, ‘♦’, ‘♣’, ‘♥’,‘♠’ ... A comprehensive list can be found on theproject website [14].

• For an enumeration sentence that contains at leastone enumeration phrase (longer than 5 words), webreak down the sentence to short sentences for eachenumerated item.

3. Named Entity Handling. Sometimes a sequenceof words correspond to the name of entities that havea specific meaning collectively. For instance, considerthe phrases “Pandora internet radio”, “Google maps”,which are the names of applications. Further resolutionof these phrases using grammatical syntax is unnecessaryand would not bring forth any semantic value. Thus, weidentify such phrases and annotate them as single lexi-cal units. We achieve so by maintaining a static lookuptable.

4. Abbreviation Handling. Natural-language sen-tences often consist of abbreviations mixed with text.This can result in subsequent components to incorrectlyparse a sentence. We find such instances and annotatethem as a single entity. For example, text followed byabbreviations such as “Instant Message (IM)” is treatedas single lexical unit. Detecting such abbreviations isachieved by using the common structure of abbreviationsand encoding such structures into regular expressions.Typically, regular expressions provide a reasonable ap-proximation for handling abbreviations.

Figure 3: Sentence annotated with Stanford dependen-cies

3.2 NLP ParserThe NLP parser accepts the pre-processed documentsand annotates every sentence within each document us-ing standard NLP techniques. From an implementationperspective, we chose the Stanford Parser [30]. However,this component can be implemented using any other ex-isting NLP libraries or frameworks:

1. Named Entity Recognition: NLP parser identifiesthe named entities in the document and annotatesthem. Additionally, these entities are further addedto the lookup table, so that the preprocessor use theentities for processing subsequent sentences.

2. Stanford-Typed Dependencies: [24, 25] NLPparser further annotates the sentences withStanford-typed dependencies. Stanford-typeddependencies is a simple description of the gram-matical relationships in a sentence, and targetedtowards extraction of textual relationships. Inparticular, we use standford-typed dependenciesas an input to our intermediate-representationgenerator.

Next we use an example to illustrate the annotationsadded by the NLP Parser. Consider the example sen-tence “Also you can share the yoga exercise to yourfriends via Email and SMS.”, that indirectly refers to theREAD CONTACTS permission. Figure 3 shows the sen-tence annotated with Stanford-typed dependencies. Thewords in red are the names of dependencies connectingthe actual words of the sentence (in black). Each wordis followed by the Part-Of-Speech (POS) tag of the word(in green). For more details on Stanford-typed depen-dencies and POS tags, please refer to [24, 25].

3.3 Intermediate-Representation Genera-tor

The intermediate-representation generator accepts theannotated documents and builds a relational represen-

5


Figure 4: First-order logic representation of annotatedsentence in Figure 3

tation of the document. We define our representa-tion as a tree structure that is essentially a First-Order-Logic (FOL) expression. Recent research has shownthe adequacy of using FOL for NLP related analysistasks [31–33]. In our representation, every node in thetree except for the leaf nodes is a predicate node. The leafnodes represent the entities. The children of the predicatenodes are the participating entities in the relationship rep-resented by the predicate. The first or the only child ofa predicate node is the governing entity and the secondchild is the dependent entity. Together the governing en-tity, predicate and the dependent entity node form a tuple.

We implemented our intermediate-representation gen-erator based on the principle shallow parsing [34] tech-niques. A typical shallow parser attempts to parse asentence based on the function of POS tags. However,we implemented our parser as a function of Stanford-typed dependencies [22, 24, 25, 30]. We chose Stanford-typed dependencies for parsing over POS tags becauseStanford-typed dependencies annotate the grammaticalrelationships between words in a sentence, thus providemore semantic information than POS tags that merelyhighlight the syntax of a sentence.

In particular, our intermediate-representation genera-tor is implemented as a series of cascading finite statemachines (FSM). Earlier research [31,33–36] has shownthe effectiveness and efficiency of using FSM in linguis-tics analysis such as morphological lookup, POS tagging,phrase parsing, and lexical lookup. We wrote semantictemplates for each of the typed dependencies providedby the Stanford Parser.

Table 1 shows a few of these semantic templates. Col-umn “Dependency” lists the name of the Stanford-typeddependency, Column “Example” lists an example sen-tence containing the dependency, and Column “Descrip-tion” describes the formulation of tuple from the depen-dency. All of these semantic templates are publicly avail-able on our project website [14]. Figure 4 shows theFOL representation of the sentence in Figure 3. Forease of understanding, read the words in the order of

the numbers following them. For instance, “you” ←“share” → “yoga exercise” forms one tuple. Noticethe additional predicate “owned” annotated 6 in the Fig-ure 4, does not appear in actual sentence. The additionalpredicate “owned” is inserted when our intermediate-representation generator encounters the possession mod-ifier relation (annotated “poss” in Figure 3).

The FOL representation helps us effectively deal withthe problem of confounding effects of keywords as de-scribed in Section 2. In particular, the FOL assists indistinguishing between a resource that would be a leafnode and an action that would be a predicate node in theintermediate representation of a sentence. The generatedFOL representation of the sentence is then provided asan input to the semantic engine for further processing.

3.4 Semantic Engine (SE)

The Semantic Engine (SE) accepts the FOL representa-tion of a sentence and based on the semantic graphs ofAndroid permissions annotates a sentence if it matchesthe criteria. A semantic graph is basically a semanticrepresentation of the resources which are governed by apermission. For instance, the READ CONTACTS per-mission governs the resource “CONTACTS” in Androidsystem.

Figure 5 shows the semantic graph for the permissionREAD CONTACTS. A semantic graph primarily con-stitutes of subordinate resources of a permission (repre-sented in rectangular boxes) and a set of available actionson the resource itself (represented in curved boxes). Sec-tion 3.5 elaborates on how we build such graphs system-atically.

Our SE accepts the semantic graph pertaining to apermission and annotates a sentence based on the algo-rithm shown in Algorithm 1. The Algorithm acceptsthe FOL representation of a sentence rep, the seman-tic graph associated with the resource of a permissiong and a boolean value recursion that governs the recur-sion. The algorithm outputs a boolean value isPStmt,which is true if the statement describes the permissionassociated with a semantic graph (g), otherwise false.

Our algorithm systematically explores the FOL rep-resentation of the sentence to determine if a sentencedescribes the need for a permission. First, our al-gorithm attempts to locate the occurrence of associ-ated resource name within the leaf node of the FOLrepresentation of the sentence (Line 3). The methodfindLeafContaining(name) explores the FOL rep-resentation to find a leaf node that contains term name.Furthermore, we use WordNet and Lemmatisation [37]to deal with synonyms of a word in question to find ap-propriate matches. Once a leaf node is found, we system-atically traverse the tree from the leaf node to the root,

6


Table 1: Semantic Templates for Stanford Typed DependenciesS. No. Dependency Example Description

1 conj “Send via SMS and email.” Governor (SMS) and dependent (email) areconj and(email, SMS) connected by a relationship of conjunction type(and)

2 iobj “This App provides you with beautiful Governor (you) is treated as dependent entity ofwallpapers.” iobj(provides, you) relationship resolved by parsing the

dependent’s (provides) typed dependencies3 nsubj “This is a scrollable widget.” Governor (This) is treated as governing entity of

nsubj(widget, This) relationship resolved by parsing thedependent’s (widget) typed dependencies

matching all parent predicates as well as immediate childpredicates [Lines 5-16].

Our algorithm matches each of the traversed predicatewith the actions associated with the resource defined insemantic graph. Similar to matching entities, we alsoemploy WordNet and Lemmatisation [37] to deal withsynonyms to find appropriate matches. If a match isfound, then the value isPStmt is set to true, indicat-ing that the statement describes a permission.

In case no match is found, our algorithms recursivelysearch all the associated subordinate resources in the se-mantic graph of current resource. A subordinate resourcemay further have its own subordinate resources. Cur-rently, our algorithm considers only immediate subordi-nate resources of a resource to limit the false positives.

In the context of the FOL representation shown in Fig-ure 4, we invoke Algorithm 1 with the semantic graphshown in Figure 5. Our algorithm attempts to find aleaf node containing term “CONTACT” or some of itssynonym. Since the FOL representation does not con-tain such a leaf node, algorithm calls itself with se-mantic graphs of subordinate resources (Line 17-25),namely ‘NUMBER’, ‘EMAIL’, ‘LOCATION’, ‘BIRTH-DAY’, ‘ANNIVERSARY’.

The subsequent invocation will find the leaf-node“email” (annotated 9 in Figure 4). Our algorithmthen explores the preceding predicates and finds pred-icate “share” (annotated 2 in Figure 4). The Algo-rithm matches the word “share” with action “send”(using Lemmatisation and WordNet similarity), one ofthe actions available in the semantic graph of resource‘EMAIL’ and returns true. Thus, the sentence is appro-priately identified as describing the need for permissionREAD CONTACT.

3.5 Semantic-Graph Generator

A key aspect of our proposed framework is the employ-ment of a semantic graph of a permission to perform deepsemantic inference of sentences. In particular, we pro-pose to initially infer such graphs from API documents.

Algorithm 1 Sentence AnnotatorInput: K Graph g, FOL rep rep, Boolean recursionOutput: Boolean isPStmt

1: Boolean isPStmt = f alse2: String r name = g.resource Name3: FOL rep r′ = rep. f indLea fContaining(r name)4: List actionList = g.actionList5: while (r′.hasParent) do6: if actionList.contains(r′.parent.predicate) then7: isPStmt = true8: break9: else

10: if actionList.contains(r′.le f tSibling.predicate) then11: isPStmt = true12: break13: end if14: end if15: r′ = r′.parent16: end while17: if ((NOT (isPStmt)) AND recursion) then18: List resourceList = g.resourceList19: for all (Resource res in resourceList) do20: isPStmt = Sentence Annotator(getKGraph(res), rep, f alse)21: if isPStmt then22: break23: end if24: end for25: end if

26: return isPStmt

For third-party applications in mobile devices, the rela-tively limited resources (memory and computation powercompared to desktops and servers) encourage develop-ment of thin clients. The key insight to leverage APIdocuments is that mobile applications are predominantlythin clients, and actions and resources provided by APIdocuments can cover most of the functionality performedby these thin clients.

Manually creating a semantic graph is prohibitivelytime consuming and may be error prone. We thus cameup with a systematic methodology to infer such seman-tic graphs from API documents that can potentially beautomated. First, we leverage Au et al.’s work [38]to find the API document of the class/interface pertain-ing to a particular permission. Second, we identifythe corresponding resource associated with the permis-sion from the API class name. For instance, we iden-tify ‘CONTACTS’ and ‘ADDRESS BOOK’ from the

7


Figure 5: Semantic Graph for the READ CONTACTpermission

ContactsContract.Contacts1 class that is associ-ated with READ CONTACT permission. Third, we sys-tematically inspect the member variables and membermethods to identify actions and subordinate resources.

From the name of member variables, we ex-tract noun phrases and then investigate their typesfor deciding whether these noun phrases describeresources. For instance, one of member vari-ables of ContactsContract.Contracts class leadsus to its member variable “email” (whose typeis ContactsContract.CommonDataKinds.Email).From this variable, we extract noun phrase “EMAIL” andclassify the phrase as a resource.

From the name of an Android API public method(describing a possible action on a resource), we ex-tract both noun phrases and their related verb phrases.The noun phrases are used as resources and the verbphrases are used as the associated actions. For instance,ContactsContract.Contacts defines operations In-sert, Update, Delete, and so on. We consider those opera-tions as individual actions associated with ‘CONTACTS’resource.

The process is iteratively applied to the individual sub-ordinate resources that are discovered for an API. For in-stance, “EMAIL” is identified as a subordinate resourcein “CONTACT” resource. Figure 5 shows a sub-graph ofgraph for READ CONTACT permission.

4 Evaluation

We now present the evaluation of WHYPER. Given anapplication, the WHYPER framework bridges the seman-

1http://developer.android.com/reference/android/provider/ContactsContract.Contacts.html

tic gap between user expectations and the permissionsit requests. It does this by identifying in the applica-tion description the sentences that describe the need for agiven permission. We refer to these sentences as permis-sion sentences. To evaluate the effectiveness of WHY-PER, we compare the permission sentences identified byWHYPER to a manual annotation of all sentences in theapplication descriptions. This comparison provides aquantitative assessment of the effectiveness of WHYPER.Specifically, we seek to answer the following researchquestions:

• RQ1: What are the precision, recall and F-Score ofWHYPER in identifying permission sentences (i.e.,sentences that describe need for a permission)?

• RQ2: How effective WHYPER is in identifyingpermission sentences, compared to keyword-basedsearching ?

4.1 Subjects

We evaluated WHYPER using a snapshot of popular ap-plication descriptions. This snapshot was downloaded inJanuary 2012 and contained the top 500 free applicationsin each category of the Google Play Store (16,001 totalunique applications). We then identified the applicationsthat contained specific permissions of interest.

For our evaluation, we consider theREAD CONTACTS, READ CALENDAR, andRECORD AUDIO permissions. We chose thesepermissions because they protect tangible resources thatusers understand and have significant enough securityand privacy implications that developers should providejustification in the application’s description. We foundthat 2327 applications had at least one of these threepermissions. Since our evaluation requires manual effortto classify each sentence in the application description,we further reduced this set of applications by randomlychoosing 200 applications for each permission. Thisresulted in a total of 600 unique applications for thesepermissions. The set was further reduced by only con-sidering applications that had an English description).Overall, we analysed 581 application descriptions, whichcontained 9,953 sentences (as parsed by WHYPER).

4.2 Evaluation Setup

We first manually annotated the sentences in the appli-cation descriptions. We had three authors independentlyannotate sentences in our corpus, ensuring that each sen-tence was annotated by at least two authors. The individ-ual annotations were then discussed by all three authorsto reach to a consensus. In our evaluation, we annotated

8


Table 2: Statistics of Subject permissionsPermission #N #S SP

READ CONTACTS 190 3379 235READ CALENDAR 191 2752 283RECORD AUDIO 200 3822 245TOTAL 581 9953 763#N: Number of applications that requests the permission; #S: Totalnumber of sentences in the application descriptions; SP: Number ofsentences manually identified as permission sentences.

a sentence as a permission sentence if at least two au-thors agreed that the sentence described the need for apermission. Otherwise we annotated the sentence as apermission-irrelevant sentence.

We applied WHYPER on the application descrip-tions and manually measured the number of true

positives (T P), false positives (FP), true

negatives (T N) and false negatives (FN) pro-duced by WHYPER as follows:

1. T P: A sentence that WHYPER correctly identifiesas a permission sentence.

2. FP: A sentence that WHYPER incorrectly identifiesas a permission sentence.

3. T N: A sentence that WHYPER correctly identifiesas not a permission sentence.

4. FN: A sentence that WHYPER incorrectly identifiesas not a permission sentence.

Table 2 shows the statistics of the subjects used in theevaluations of WHYPER. Column “Permission” lists thenames of the permissions. Column “#N” lists the numberof applications that requests the permissions used in ourevaluations. Column “#S” lists the total number of sen-tences in application descriptions. Finally, Column “SP”lists the number of sentences that are manually identi-fied as permission sentences by authors. We used thismanual identification (Column “SP”) to quantify the ef-fectiveness of WHYPER in identifying permission sen-tences by answering RQ1. The results of Column “SP”is also used to compare WHYPER with keyword-basedsearching to answer RQ2 described next.

For RQ2, we applied keyword-based searching on thesame subjects. We consider a word as a keyword inthe context of a permission if it is a synonym of theword in the permission. To minimize manual efforts, weused words present in Manifest.permission class

from Android API. Table 4 shows the keywords used inour evaluation. We then measured the number of truepositives (T P′), false positives (FP′), true

negatives (T N′), and false negatives (FN′) pro-duced by the keyword-based searching as follows:

1. T P′:- A sentence that is a permission sentence andcontains the keywords.

2. FP′:- A sentence that is not a permission sentencebut contains the keywords.

3. T N′:- A sentence that is not a permission sentenceand does not contain the keywords.

4. FN′:- A sentence that is a permission sentence butdoes not contain the keywords.

In statistical classification [39], Precision is defined asthe ratio of the number of true positives to the total num-ber of items reported to be true, and Recall is definedas the ratio of the number of true positives to the totalnumber of items that are true. F-score is defined as theweighted harmonic mean of Precision and Recall. Accu-racy is defined as the ratio of sum of true positives andtrue negatives to the total number of items. Higher val-ues of precision, recall, F-Score, and accuracy indicatehigher quality of the permission sentences inferred us-ing WHYPER. Based on the total number of TPs, FPs,TNs, and FNs, we computed the precision, recall, F-score, and accuracy of WHYPER in identifying permis-sion sentences as follows:

Precision = T PT P + FP

Recall = T PT P + FN

F-score = 2 X Precision X RecallPrecision + Recall

Accuracy = T P+T NT P + FP+T N + FN

4.3 ResultsWe next describe our evaluation results to demonstratethe effectiveness of WHYPER in identifying contract sen-tences.

4.3.1 RQ1: Effectiveness in identifying permissionsentences

In this section, we quantify the effectiveness of WHY-PER in identifying permission sentences by answer-ing RQ1. Table 3 shows the effectiveness of WHY-PER in identifying permission sentences. Column “Per-mission” lists the names of the permissions. Col-umn “SI” lists the number of sentences identified byWHYPER as permission sentences. Columns “TP”,“FP”, “TN”, and “FN” represent the number of true

positives, false positives, true negatives,and false negatives, respectively. Columns “P(%)”,“R(%)”, “FS(%)”, and “Acc(%)” list percentage valuesof precision, recall, F-score, and accurary re-spectively. Our results show that, out of 9,953 sen-tences, WHYPER effectively identifies permission sen-tences with the average precision, recall, F-score, and

9


Table 3: Evaluation resultsPermission SI TP FP FN TN P (%) R (%) FS (%) Acc (%)READ CONTACTS 204 186 18 49 2930 91.2 79.1 84.7 97.9READ CALENDAR 288 241 47 42 2422 83.7 85.1 84.4 96.8RECORD AUDIO 259 195 64 50 3470 75.9 79.7 77.4 97.0TOTAL 751 622 129 141 9061 82.8∗ 81.5∗ 82.2∗ 97.3∗∗ Column average; SI : Number of sentences identified by WHYPER as permission sentences; TP: Total number of TruePositives; FP: Total number of False Positives; FN: Total number of False Negatives; TN: Total number of True Negatives;P: Precision; R: Recall; FS: F-Score; and Acc: Accuracy

accuracy of 82.8%, 81.5%, 82.2%, and 97.3%, respec-tively.

We also observed that out of 581 applications whosedescriptions we used in our experiments, there were only86 applications that contained at least one false negativestatement that were annotated by WHYPER. Similarly,among 581 applications whose descriptions we used inour experiments, there were only 109 applications thatcontained at least one false positive statement that wereannotated by WHYPER.

We next present an example to illustrate how WHYPERincorrectly identifies a sentence as a permission sentence(producing false positives). False positives are particu-larly undesirable in the context of our problem domain,because they can mislead the users of WHYPER into be-lieving that a description actually describes the need fora permission. Furthermore, an overwhelming number offalse positives may result in user fatigue, and thus de-value the usefulness of WHYPER.

Consider the sentence “You can now turn recordingsinto ringtones.”. The sentence describes the applicationfunctionality that allows users to create ringtones frompreviously recorded sounds. However, the describedfunctionality does not require the permission to recordaudio. WHYPER identifies this sentence as a permis-sion sentence. WHYPER correctly identifies the wordrecordings as a resource associated with the record au-dio permission. Furthermore, our intermediate represen-tation also correctly constructs that action turn is beingperformed on the resource recordings. However, our se-mantic engine incorrectly matches the action turn withthe action start. The later being a valid semantic actionthat is permitted by the API on the resource recording.In particular, the general purpose WordNet-based Simi-larity Metric [37] shows 93.3% similarity. We observedthat a majority of false positives resulted from incorrectmatching of semantic actions against a resources. Suchinstances can be addressed by using domain-specific dic-tionaries for synonym analysis.

Another major source of FPs is the incorrect pars-ing of sentences by the underlying NLP infrastructure.For instance, consider the sentence “MyLink Advancedprovides full synchronization of all Microsoft Outlook

emails (inbox, sent, outbox and drafts), contacts, calen-dar, tasks and notes with all Android phones via USB.”.The sentence describes the users calendar will be syn-chronized. However, the underlying Stanford parser [30]is not able to accurately annotate the dependencies. Ourintermediate-representation generator uses shallow pars-ing that is a function of these dependencies. An inaccu-rate dependency annotation causes an incorrect construc-tion of intermediate representation and eventually causesan incorrect classification. Such instances will be ad-dressed with the advancement in underlying NLP infras-tructure. Overall, a significant number of false positiveswill be reduced by as the current NLP research advancesthe underlying NLP infrastructure.

We next present an example to illustrate how WHY-PER fails to identify a valid permission sentence (falsenegatives). Consider the sentence “Blow into the mic toextinguish the flame like a real candle”. The sentencedescribes a semantic action of blowing into the micro-phone. The noise created by blowing will be captured bythe microphone, thus implying the need for record audiopermission. WHYPER correctly identifies the word micas a resource associated with the record audio permis-sion. Furthermore, our intermediate representation alsocorrectly shows that the action blow into is performed onthe resource mic. However, from API documents, thereis no way for WHYPER framework to infer the knowl-edge that blowing into microphone semantically impliesrecording of the sound. Thus, WHYPER fails to identifythe sentence as a permission sentence. We can reducea significant number of false negatives by constructingbetter semantic graphs.

Similar to reasons for false positives, a major sourceof false negatives is the incorrect parsing of sentencesby the underlying NLP infrastructure. For instance, con-sider the sentence “Pregnancy calendar is an applicationthat,not only allows, after entering date of last periodmenstrual,to calculate the presumed (or estimated) dateof birth; but, offering prospects to show,week to week,allappointments which must to undergo every mother,ad arule,for a correct and healthy pregnancy.” 2 The sen-

2Note that the incorrect grammar, punctuation, and spacing are a

10


tence describes that the users calendar will be used todisplay weekly appointments. However, the length andcomplexity in terms of number of clauses causes the un-derlying Stanford parser [30] to inaccurately annotatethe dependencies, which eventually results into incorrectclassification.

We also observed that in a few cases, the process fol-lowed to identify sentence boundaries resulted in ex-tremely long and possibly incorrect sentences. Considera sentence that our preprocessor did not identify as a per-mission sentence for READ CALENDER permission:

Daily Brief “How does my day look like today” “Any meetings today” “My

reminders” “Add reminder” Essentials Email, Text, Voice dial, Maps, Directions,

Web Search “Email John Subject Hello Message Looking forward to meeting

with you tomorrow” “Text Lisa Message I will be home in an hour” “Map of

Chicago downtown” “Navigate to Millenium Park” “Web Search Green Bean

Casserole” “Open Calculator” “Opean Alarm Clock” “Launch Phone book” Per-

sonal Health Planner ... “How many days until Christmas” Travel Planner “Show

airline directory” “Find hotels” “Rent a car” “Check flight status” “Currency con-

verter” Cluzee Car Mode Access Daily Brief, Personal Radio, Search, Maps, Di-

rections etc..A few fragments (sub-sentences) of this incorrectly

marked sentence describe the need for read calender per-mission (“...My reminder ... Add reminder ...”). How-ever, inaccurate identification of sentence boundariescauses the underlying NLP infrastructure produces a in-correct annotation. Such incorrect annotation is propa-gated to subsequent phases of WHYPER, ultimately re-sulting in inaccurate identification of a permission sen-tence. Overall, as the current research in the filed of NLPadvances the underlying NLP infrastructure, a significantnumber of false negatives will be reduced.

4.3.2 RQ2: Comparison to keyword-based search-ing

In this section, we answer RQ2 by comparing WHY-PER to a keyword-based searching approach in identify-ing permission sentences. As described in Section 4.2,we manually measured the number of permission sen-tences in the application descriptions. Furthermore, wealso manually computed the precision (P), recall (R),f-score (FS), and accuracy (Acc) of WHYPER as wellas precision (P′), recall (R′), f-score (F ′

S), and accuracy(Acc′) of keyword-based searching in identifying per-mission sentences. We then calculated the improvementin using WHYPER against keyword-based searching as∆P = P−P′, ∆R = R−R′, ∆FS = FS −F ′

S, and ∆Acc =Acc − Acc′. Higher values of ∆P, ∆R, ∆FS, and ∆Accare indicative of better performance of WHYPER againstkeyword-based search.

Table 5 shows the comparison of WHYPER in identi-fying permission sentences to keyword-based searching

reproduction of the original description.

Table 4: Keywords for PermissionsS. No Permission Keywords

1 READ CONTACTS contact, data, number,name, email

2 READ CALENDAR calendar, event, date,month, day, year

3 RECORD AUDIO record, audio, voice,capture, microphone

Table 5: Comparison with keyword-based searchPermission ∆P% ∆R% ∆FS% ∆Acc%

READ CONTACTS 50.4 1.3 31.2 7.3READ CALENDAR 39.3 1.5 26.4 9.2RECORD AUDIO 36.9 -6.6 24.3 6.8

Average 41.6 -1.2 27.2 7.7

approach. Columns “∆P”, “∆R”, “∆FS”, and “∆Acc” listpercentage values of increase in the precision, recall, f-scores, and accuracy respectively. Our results show that,in comparison to keyword-based searching, WHYPER ef-fectively identifies permission sentences with the averageincrease in precision, F-score, and accuracy of 41.6%,27.2%, and 7.7% respectively. We indeed observed a de-crease in average recall by 1.2%, primarily due to poorperformance of WHYPER for RECORD AUDIO permis-sion.

However, it is interesting to note that there is a sub-stantial increase in precision (average 46.0%) in com-parison to keyword-based searching. This increase isattributed to a large false positive rate of keyword-based searching. In particular, for descriptions relatedto READ CONTACTS permission, WHYPER resultedin only 18 false positives compared to 265 false posi-tives resulted by keyword-based search. Similarly, fordescriptions related to RECORD AUDIO, WHYPER re-sulted in 64 false positives while keyword-based search-ing produces 338 false positives.

We next present illustrative examples of how WHY-PER performs better than keyword-based search in con-text of false positives. One major source of false posi-tives in keyword-based search is confounding effects ofcertain keywords such as contact. Consider the sentence“contact me if there is a bad translation or you’d likeyour language added!”. The sentence describes that de-veloper is open to feedback about his application. Akeyword-based searching incorrectly identifies this sen-tence as a permission sentence for READ CONTACTSpermission. However, the word contact here refers to anaction rather than a resource. In contrast, WHYPER cor-rectly identifies the word contact as an action applicableto pronoun me. Our framework thus correctly classifiesthe sentences as a permission-irrelevant sentence.

11


Consider another sentence “To learn more, pleasevisit our Checkmark Calendar web site: calen-dar.greenbeansoft.com” as an instance of confoundingeffect of keywords. The sentence is incorrectly identifiedas a permission sentence for READ CALENDAR per-mission because it contains keyword calendar. In con-trast, WHYPER correctly identifies “Checkmark Calen-dar” as a named entity rather than resource calendar. Ourframework thus correctly identifies the sentences as nota permission sentence.

Another common source of false positives in keyword-based searching is lack of semantic context around akeyword. For instance, consider the sentence “That’swhat this app brings to you in addition to learningnumbers!”. A keyword-based search classifies thissentence as an permission sentence because it con-tains the keyword number, which is one of the key-words for READ CONTACTS permission as listed inTable 4. However, the sentence is actually describingthe generic numbers rather than phone numbers. Simi-lar to keyword-based search, our framework also identi-fies word number as a candidate match for subordinat-ing resource number in READ CONTACTS permission(Figure 5). However, the identified semantic action oncandidate resource number for this sentence is learning.Since learning is not an applicable action to phone num-ber resource in our semantic graphs, WHYPER correctlyclassifies the sentences as not a permission sentence.

The final category, where WHYPER performed betterthan keyword-based search, is due to the use of syn-onyms. For instance, address book is a synonym for con-tact resource. Similarly mic is synonym for microphoneresource. Our framework, leverages this synonym infor-mation in identifying the resources in a sentence. Syn-onyms could potentially be used to augment the list ofkeywords in keyword-based search. However, given thatkeyword-based search already suffers from a very highfalse positive rate, we believe synonym augmentation tokeywords would further worsen the problem.

We next present discussions on why WHYPER causeda decline in recall in comparison to keyword-basedsearch. We do observe a small increase in recall forREAD CONTACTS (1.3%) and READ CALENDAR(1.5%) permission related sentences, indicating thatWHYPER performs slightly better than keyword-basedsearch. However, WHYPER performs particularly worsein RECORD AUDIO permission related descriptions,which results in overall decrease in the recall comparedto keyword-based search.

One of the reasons for such decline in the recall is anoutcome of the false negatives produced by our frame-work. As described in Section 4.3.1 incorrect identifica-tion of sentence boundaries and limitations of underlyingNLP infrastructure caused a significant number of false

negatives in WHYPER. Thus, improvement in these ar-eas will significantly decrease the false negative rate ofWHYPER and in turn, make the existing gap negligible.

Another cause of false negatives in our approach is in-ability to infer knowledge for some ‘resource’-‘semanticaction’ pairs, for instance, ‘microphone’-‘blow into’. Wefurther observed, that with a small manual effort in aug-menting semantic graphs for a permission, we couldsignificantly bring down the false negatives of our ap-proach. For instance, after a precursory observation offalse negative sentences for RECORD AUDIO permis-sion manually, we augmented the semantic graphs withjust two resource-action pairs (1. microphone-blow intoand 2. call-record). We then applied WHYPER with theaugmented semantic graph on READ CONTACTS per-mission sentences.

The outcome increased ∆R value from -6.6% to 0.6%for RECORD AUDIO permission and an average in-crease of 1.1% in ∆R across all three permissions, with-out affecting values for ∆P. We refrained from includingsuch modifications for reporting the results in Table 5to stay true to our proposed framework. In the future,we plan to investigate techniques to construct better se-mantic graphs for permissions, such as mining user com-ments and forums.

4.4 Summary

In summary, we demonstrate that WHYPER effectivelyidentifies permission sentences with the average preci-sion, recall, F-score, and accuracy of 80.1%, 78.6%,79.3%, and 97.3% respectively. Furthermore, we alsoshow that WHYPER performs better than keyword-basedsearch with an average increase in precision of 40% witha relatively small decrease in average recall (1.2%). Wealso provide discussion that such gap in recall can be al-leviated by improving the underlying NLP infrastructureand a little manual effort. We next present discussions onthreats to validity.

4.5 Threats to Validity

Threats to external validity primarily include the degreeto which the subject permissions used in our evaluationswere representative permissions. To minimize the threat,we used permissions that guard against the resources thatcan be subjected to privacy and security sensitive opera-tions. The threat can be further reduced by evaluatingWHYPER on more permissions. Another threat to ex-ternal validity is the representativeness of the descrip-tion sentences we used in our experiments. To minimizethe threat we randomly selected actual Android appli-cation descriptions from a snapshot of the meta-data of

12


16001 applications from Google Play store (dated Jan-uary 2012).

Threats to internal validity include the correctness ofour implementation in inferring mapping between natu-ral language description sentences and application per-missions. To reduce the threat, we manually inspectedall the sentences annotated by our system. Furthermore,we ensured that the results were individually verified andagreed upon by at least two authors. The results of ourexperiments are publicly available on the project web-site [14].

5 Discussions and Future work

Our framework currently only takes into account appli-cation descriptions and Android API documents to high-light permission sentences. Thus, our framework cansemi-formally enumerate the uses of a permission. Thisinformation can be leveraged in the future to enhancesearching for desired applications.

Furthermore, the outputs from our framework could beused in conjunction with program analysis techniques tofacilitate effective code reuse. For instance, our frame-work outputs the reasons of why a permission is neededfor an application. These reasons are usually the func-tionalities provided by the application. In future work,we plan to locate the code fragments that implement thedescribed functionalities. Such mapping can be used asindexes to existing code searching approaches [40,41] tofacilitate effective reuse.

Modular Applications: In our evaluations, we en-countered cases where a description referring to anotherapplication where the permission sentences were de-scribed. For instance, consider the following descrip-tion sentence “Navigation2GO is an application to eas-ily launch Google Maps Navigation.”.

The description states that the current application willlaunch another application “Google Maps Navigation”,and thus requires the permissions required by that ap-plication. Currently, our framework does not deal withsuch cases. We plan to implement deeper semantic anal-ysis of description sentences to identify such permissionsentences.

Generalization to Other Permissions: WHYPER isdesigned to identify the textual justification for permis-sions that protect “user understandable” information andresources. That is, the permission must protect an infor-mation source or resource that is in the domain of knowl-edge of general smartphone users, as opposed to a low-level API only known to developers. The permissionsstudied in this paper (i.e., address book, calendar, micro-phone) fall within this domain. Based on our studies,we expect similar permissions, such as those that protectSMS interfaces and data stores, the ability to make and

receive phone calls, read call logs and browser history,operate and administer Bluetooth and NFC, and accessand use phone accounts will have similar success withWHYPER.

Due to current developer trends and practices, thereis class of permissions that we expect will raise alarmsfor many applications when evaluated with WHYPER.Recent work [7, 11] has shown that many applicationsleak geographic location and phone identifiers withoutthe users knowledge. We recommend that deploymentsof WHYPER first focus on other permissions to bettergauge and account for developer response. Once generaldeployment experience with WHYPER has been gained,these more contentious permissions should be tackled.We believe that adding justification for access to geo-graphic location and phone identifiers in the application’stextual description will benefit users. For example, if anapplication uses location for ads or analytics, the devel-oper should state this in the application description.

Finally, there are some permissions that are implicitlyused by applications and therefore will have poor resultswith WHYPER. In particular, we do not expect the Inter-net permission to work well with WHYPER. Nearly allsmartphone applications access the Internet, and we ex-pect attempts to build a semantic graph for the Internetpermission will be largely ineffective.

Results Presentation: A potential after-effect of us-ing WHYPER on existing application descriptions mightbe more verbose application descriptions. One can arguethat it would lead to additional burden on end users toread a lengthy description. However, such additional de-scription provides an opportunity for the users to makeinformed decisions instead of making assumptions. Infuture work, we plan to implement and evaluate inter-faces to present users with information in a more efficientway, countering user fatigue in case of lengthy descrip-tions. For example, we may consider using icons in dif-ferent colors to represent permissions with and withoutexplanation.

6 Related Work

Our proposed framework touches quiet a few researchareas such as mining Mobile Application Stores, NLP onsoftware engineering artifacts, program comprehensionand software verification. We next discuss relevant workpertinent to our proposed framework in these areas.

Harman et al. [42], first used mining approaches onthe application description, pricing, and customer rat-ings in Blackberry App Store. In particular, they uselight-weight text analytics to extract feature information,which is then combined with pricing and customer rat-ing to analyze applications’ business aspects. In con-trast, WHYPER uses deep semantic analysis of sentences

13


in application descriptions, which can be used as a com-plimentary approach to their feature extraction.

The use of NLP techniques is not new in the softwareengineering domain. NLP has been used previously inrequirements engineering [31, 32, 43]. NLP has evenbeen used to assess the usability of API docs [44].

There are existing approaches [33, 45–47] that applyeither NLP or a mix of NLP and machine learning algo-rithms to infer specifications from the natural-languageelements such as code comments, API descriptions, andformal requirements documents. The semi-formal struc-ture of natural-language documents used in these ap-proaches facilitate the application of NLP techniquesrequired for these problem areas. Furthermore, theseapproaches often rely on some meta-information fromsource code as well. In contrast, our proposed frameworktargets a relatively more unconstrained natural-languagetext and is independent of the source code of the applica-tion under analysis.

With respect to program comprehension there areexisting techniques that assist in building domain-specific ontologies [48]. Furthermore, there are exist-ing approaches [49, 50] that automatically infer natural-language documentation from source code. These ap-proaches would immensely help in comprehension of thefunctionality of a mobile application. However, the in-herent dependency on source code to generate such doc-uments poses a problem in cases, where source code isnot available. In contrast, WHYPER relies on applicationdescription and API documents that are readily availableartifact for mobile applications.

In the realm of mobile software verification, there isexisting work on permissions [2–5, 15], code [6–10],and runtime behavior [11–13] to detect malicious ap-plications. In particular, Zhou et al. [3] propose an ap-proach that leverages the permission information in themanifest of the applications as a criteria to filter mali-cious applications. They further employed static analy-sis to identify the malicious applications by forming be-havioral patterns in terms of sequences of API calls ofknown malicious applications. They also propose the useof heuristics-based dynamic analysis to detect previouslyunknown applications. Furthermore, Enck et al. [11] alsouse dynamic analysis techniques (dynamic taint analysis)to detect misuse of users private information.

These previously described techniques are primarilytargeted towards finding malicious applications in mo-bile applications. However, these approaches do little forbridging the semantic gap between what the user expectsan application to do and what it actually does. In con-trast, WHYPER is the first step targeted towards bridgingthis gap. Furthermore, WHYPER can be used in conjunc-tion with these approaches for an improved experiencewhile interacting with mobile ecosystem.

In addition, Felt et al. [28] apply automated testingtechniques to find permission required to invoke eachmethod in the Android 2.2 API. They use this informa-tion to detect over-privilege in Android Applications, bygenerating a maximum set of permissions required byan applications and comparing them to the permissionsrequested by the application. Although it is importantfrom a developer perspective to minimize the set of per-missions requested, the information does not empoweran end user to decide what the requested permissions arebeing used for. In contrast, WHYPER leverages some ofthe results provided by Felt et al. [28] to highlight thesentences describing the need for a permission, in turnenabling end users to make informed decision while in-stalling and application

Finally, Lin et al. [27] introduce a new model for pri-vacy, namely privacy as expectations for mobile appli-cations. In particular, they use crowd-sourcing as meansto capture users expectations of sensitive resources usedby a mobile applications. We believe the sentences high-lighted by WHYPER, can be used as supporting evidenceto formulate such user expectations at the first place.

7 Conclusion

In this paper, we have presented WHYPER, a frame-work that uses Natural Language Processing (NLP) tech-niques to determine why an application uses a permis-sion. We evaluated our prototype implementation ofWHYPER on real-world application descriptions that in-volve three permissions (address book, calendar, andrecord audio). These are frequently-used permissionsthat protect privacy and security sensitive resources. Ourevaluation results show that WHYPER achieves an aver-age precision of 82.8%, and an average recall of 81.5%for three permissions. In summary, our results demon-strate great promise in using NLP techniques to bridgethe semantic gap of user expectations to aid the risk as-sessment of mobile applications.

8 Acknowledgments

This work was supported in part by an NSA Scienceof Security Lablet grant at North Carolina State Uni-versity, NSF grants CCF-0845272, CCF-0915400, CNS-0958235, CNS-1160603, CNS-1222680, and CNS-1253346. Any opinions, findings, and conclusions orrecommendations expressed in this material are those ofthe authors and do not necessarily reflect the views ofthe funding agencies. We would also like to thank theconference reviewers and shepherds for their feedbackin finalizing this paper.

14


References

[1] H. Lockheimer, “Android and security,” GoogleMobile Blog, Feb. 2012, http://googlemobile.blogspot.com/2012/02/android-and-security.html.

[2] A. P. Felt, M. Finifter, E. Chin, S. Hanna, andD. Wagner, “A survey of mobile malware in thewild,” in Proc. 1st ACM SPSM Workshop, 2011, pp.3–14.

[3] Y. Zhou, Z. Wang, W. Zhou, and X. Jiang, “Hey,you, get off of my market: Detecting maliciousApps in official and alternative Android markets,”in Proc. of 19th NDSS, 2012.

[4] H. Peng, C. Gates, B. Sarma, N. Li, Y. Qi,R. Potharaju, C. Nita-Rotaru, and I. Molloy, “Usingprobabilistic generative models for ranking risks ofAndroid Apps,” in Proc. of 19th ACM CCS, 2012,pp. 241–252.

[5] S. Chakradeo, B. Reaves, P. Traynor, and W. Enck,“MAST: Triage for market-scale mobile malwareanalysis,” in Proc. 6th of ACM WiSec, 2013, pp.13–24.

[6] M. Egele, C. Kruegel, E. Kirda, and G. Vigna,“PiOS: Detecting privacy leaks in iOS applica-tions.”

[7] W. Enck, D. Octeau, P. McDaniel, and S. Chaud-huri, “A study of Android application security,”in Proc. 20th USENIX Security Symposium, 2011,p. 21.

[8] Y. Zhou and X. Jiang, “Dissecting Android mal-ware: Characterization and evolution,” in Proc. ofIEEE Symposium on Security and Privacy, 2012,pp. 95–109.

[9] M. Grace, Y. Zhou, Q. Zhang, S. Zou, and X. Jiang,“RiskRanker: Scalable and accurate zero-day An-droid malware detection,” in Proc. of 10th MobiSys,2012, pp. 281–294.

[10] C. Gibler, J. Crussell, J. Erickson, and H. Chen,“AndroidLeaks: Automatically detecting potentialprivacy leaks in Android applications on a largescale,” in Proc. of 5th TRUST, 2012, pp. 291–307.

[11] W. Enck, P. Gilbert, B. Chun, L. Cox, J. Jung,P. McDaniel, and A. Sheth, “TaintDroid: aninformation-flow tracking system for realtime pri-vacy monitoring on smartphones,” in Proc. of 9thUSENIX OSDI, 2010, pp. 1–6.

[12] P. Hornyack, S. Han, J. Jung, S. Schechter, andD. Wetherall, “These aren’t the droids you’re look-ing for: Retrofitting Android to protect data fromimperious applications,” in Proc. 18th ACM CCS,2011, pp. 639–652.

[13] L. K. Yan and H. Yin, “DroidScope: Seamlesslyreconstructing the OS and Dalvik semantic viewsfor dynamic Android malware analysis,” in Proc.of 21st USENIX Security Symposium, 2012, p. 29.

[14] “Whyper,” https://sites.google.com/site/whypermission/.

[15] W. Enck, M. Ongtang, and P. McDaniel, “Onlightweight mobile phone application certification,”in Proc. of 16th ACM CCS, 2009, pp. 235–245.

[16] D. Barrera, H. G. Kayacik, P. C. van Oorshot, andA. Somayaji, “A methodology for empirical analy-sis of permission-based security models and its ap-plication to Android,” in Proc. of 7th ACM CCD,2010, pp. 73–84.

[17] A. P. Felt, K. Greenwood, and D. Wagner, “The ef-fectiveness of application permissions,” in Proc. of2nd USENIX WebApps, 2011, pp. 7–7.

[18] A. P. Felt, E. Ha, S. Egelman, A. Haney, E. Chin,and D. Wagner, “Android permissions: User atten-tion, comprehension and behavior,” in Proc. of 8thSOUPS, 2012, p. 3.

[19] P. McDaniel and W. Enck, “Not so great expecta-tions: Why application markets haven’t failed se-curity,” IEEE Security & Privacy Magazine, vol. 8,no. 5, pp. 76–78, 2010.

[20] J. Han, Q. Yan, D. Gao, J. Zhou, and R. Deng,“Comparing mobile privacy protection throughcross-platform applications,” in Proc. of 20thNDSS, 2013.

[21] K. Toutanova, D. Klein, C. D. Manning, andY. Singer, “Feature-rich part-of-speech taggingwith a cyclic dependency network.” in Proc. HLT-NAACL, 2003, pp. 252–259.

[22] D. Klein and D. Manning, Christopher, “Fast ex-act inference with a factored model for natural lan-guage parsing,” in Proc. 15th NIPS, 2003, pp. 3 –10.

[23] “The Stanford Natural Language ProcessingGroup,” 1999, http://nlp.stanford.edu/.

[24] M. C. de Marneffe, B. MacCartney, and C. D. Man-ning, “Generating typed dependency parses fromphrase structure parses,” in Proc. 5th LREC, 2006,pp. 449–454.

15


[25] M. C. de Marneffe and C. D. Manning, “The stan-ford typed dependencies representation,” in Proc.Workshop COLING, 2008, pp. 1–8.

[26] J. R. Finkel, T. Grenager, and C. Manning., “Incor-porating non-local information into information ex-traction systems by gibbs sampling,” in Proc. 43ndACL, 2005, pp. 363–370.

[27] J. Lin, N. Sadeh, S. Amini, J. Lindqvist, J. I. Hong,and J. Zhang, “Expectation and purpose: under-standing users’ mental models of mobile App pri-vacy through crowdsourcing,” in Proc. 14th ACMUbicomp, 2012, pp. 501–510.

[28] A. Felt, E. Chin, S. Hanna, D. Song, and D. Wag-ner, “Android permissions demystified,” in Proc. of18th ACM CCS, 2011, pp. 627–638.

[29] Fellbaum et al., WordNet: an electronic lexicaldatabase. Cambridge, Mass: MIT Press, 1998.

[30] D. Klein and C. D. Manning, “Accurate unlexical-ized parsing.” in Proc. 41st ACL, 2003, pp. 423–430.

[31] A. Sinha, A. M. Paradkar, P. Kumanan, andB. Boguraev, “A linguistic analysis engine for natu-ral language use case description and its applicationto dependability analysis in industrial use cases.” inProc. 39th DSN, 2009, pp. 327–336.

[32] A. Sinha, S. M. SuttonJr., and A. Paradkar,“Text2Test: Automated inspection of natural lan-guage use cases,” in Proc. 3rd ICST, 2010, pp. 155–164.

[33] R. Pandita, X. Xiao, H. Zhong, T. Xie, S. Oney, andA. Paradkar, “Inferring method specifications fromnatural language API descriptions,” in Proc. 34thICSE, 2012, pp. 815–825.

[34] B. K. Boguraev, “Towards finite-state analysis oflexical cohesion.” in Proc. 3rd FSMNLP, 2000.

[35] M. Stickel and M. Tyson, FASTUS: A CascadedFinite-state Transducer for Extracting Informationfrom Natural-language Text. MIT Press, 1997.

[36] G. Gregory, Light Parsing as Finite State Filtering.Cambridge University Press, 1999.

[37] Q. Do, D. Roth, M. Sammons, Y. Tu, and V. Vydis-waran, “Robust, light-weight approaches to com-pute lexical similarity,” University of Illinois, Com-puter Science Research and Technical Reports,2009.

[38] K. W. Y. Au, Y. F. Zhou, Z. Huang, and D. Lie,“PScout: analyzing the Android permission speci-fication,” in Proc. 19th CCS, 2012, pp. 217–228.

[39] D. Olson, Advanced Data Mining Techniques.Springer Verlag, 2008.

[40] S. Thummalapenta and T. Xie, “PARSEWeb: Aprogrammer assistant for reusing open source codeon the web,” in Proc. 22nd ASE, 2007, pp. 204–213.

[41] S. P. Reiss, “Semantics-based code search,” inProc. 31st ICSE, 2009, pp. 243–253.

[42] M. Harman, Y. Jia, and Y. Zhang, “App store min-ing and analysis: MSR for app stores,” in Proc. 9thIEEE MSR, 2012, pp. 108–111.

[43] V. Gervasi and D. Zowghi, “Reasoning about in-consistencies in natural language requirements,”ACM Transactions Software Engineering Method-ologies, vol. 14, pp. 277–330, 2005.

[44] U. Dekel and J. D. Herbsleb, “Improving API doc-umentation usability with knowledge pushing,” inProc. 31st ICSE, 2009, pp. 320–330.

[45] L. Tan, D. Yuan, G. Krishna, and Y. Zhou,“/*iComment: Bugs or bad comments?*/,” in Proc.21st SOSP, 2007, pp. 145–158.

[46] H. Zhong, L. Zhang, T. Xie, and H. Mei, “Infer-ring resource specifications from natural languageAPI documentation,” in Proc. 24th ASE, November2009, pp. 307–318.

[47] X. Xiao, A. Paradkar, S. Thummalapenta, andT. Xie, “Automated extraction of security poli-cies from natural-language software documents,” inProc. 20th FSE, 2012, pp. 12:1–12:11.

[48] H. Zhou, F. Chen, and H. Yang, “Developing ap-plication specific ontology for program comprehen-sion by combining domain ontology with code on-tology,” in Proc. 8th QSIC, 2008, pp. 225 –234.

[49] G. Sridhara, L. Pollock, and K. Vijay-Shanker,“Generating parameter comments and integratingwith method summaries,” in Proc. 19th ICPC,2011, pp. 71–80.

[50] P. Robillard, “Schematic pseudocode for programconstructs and its computer automation by schema-code,” Comm. of the ACM, vol. 29, no. 11, pp.1072–1089, 1986.

16

whyper: towards automating risk assessment of mobile ... · usenix association 22nd usenix security...

Documents