hiding in plain site: detecting javascript obfuscation ...of javascript (js), source code is...

Hiding in Plain Site Detecting JavaScript Obfuscationthrough Concealed Browser API Usage

Shaown SarkerNorth Carolina State University

ssarkerncsuedu

Jordan JueckstockNorth Carolina State University

jjuecksncsuedu

Alexandros KapravelosNorth Carolina State University

akapravncsuedu

ABSTRACTIn this paper we perform a large-scale measurement study ofJavaScript obfuscation of browser APIs in the wild We rely ona simple but powerful observation if dynamic analysis of a scriptrsquosbehavior (specifically how it interacts with browser APIs) revealsbrowser API feature usage that cannot be reconciled with staticanalysis of the scriptrsquos source code then that behavior is obfus-cated To quantify and test this observation we create a hybridanalysis platform using instrumented Chromium to log all browserAPI accesses by the scripts executed when a user visits a page Wefilter the API access traces from our dynamic analysis through astatic analysis tool that we developed in order to quantify howmuch and what kind of functionality is hidden on the web Whenapplying this methodology across the Alexa top 100k domains wediscover that 9590 of the domains we successfully visited containat least one script which invokes APIs that cannot be resolved fromstatic analysis We observe that eval is no longer the prominentobfuscation method on the web and we uncover families of novelobfuscation techniques that no longer rely on the use of eval

ACM Reference formatShaown Sarker Jordan Jueckstock and Alexandros Kapravelos 2020 Hidingin Plain Site Detecting JavaScript Obfuscation through Concealed BrowserAPI Usage In Proceedings of ACM Internet Measurement Conference VirtualEvent USA October 27ndash29 2020 (IMC rsquo20) 14 pageshttpsdoiorg10114534193943423616

1 INTRODUCTIONSource code obfuscation can be defined as making a program un-intelligible while maintaining its functionality intact [5] In caseof JavaScript (JS) source code is considered obfuscated if the logicand meaning is transformed in a way intended to make it difficultfor a human to understand or reverse-engineer [43] ObfuscatedJS is used not only for malicious purposes like to deliver zero-daybrowser exploits [35] but also to hide the logic of legitimate webapplications as part of software protection [11]

As the web becomes ever ubiquitous there has been a consid-erable amount of increase in both client-side [27 50] and server-side [49] JS-based attacks along with vast amounts of intellectual

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page Copyrights for components of this work owned by others than ACMmust be honored Abstracting with credit is permitted To copy otherwise or republishto post on servers or to redistribute to lists requires prior specific permission andor afee Request permissions from permissionsacmorgIMC rsquo20 October 27ndash29 2020 Virtual Event USAcopy 2020 Association for Computing MachineryACM ISBN 978-1-4503-8138-32010 $1500httpsdoiorg10114534193943423616

property moving to the browserrsquos client-side Both of these in-creasingly involve JS obfuscation as a method to either evade andobstruct detection [27 38] or to protect the intellectual propertyfrom being pirated [11] Unfortunately we currently lack the toolsto automatically analyze the web and identify new obfuscationtechniques and this affects our ability to accurately measure anddetect web threats

Despite the prevalence of JS obfuscation it is addressed briefly asa side-effect in a number of prior research work investigating mali-cious JS-based attacks [9 13 17 34] A number of studies exploredJS obfuscation in combination with identifying JS maliciousnessthe underlying implication being malicious JS should possess ob-fuscation properties [2 16 36 40 58] There also exist studies thatdiscriminate between JS obfuscation in benign and malicious sourcecodes [14 33]

Unfortunately little research has been conducted into the topicof identifying JS obfuscation as a technique itself without the intentof it [3 10 30 31 33 51 53] However all of these prior studiesalmost exclusively depended on creating a model first (either withstatic or dynamic analysis) and used a small set of JS dataset forthe purpose of training and validation - thus limiting their resultfor being used in a real world scenario

It is notable that prior research into JS obfuscation as a transportmechanism explored dynamic code execution mainly through evalto the point that obfuscation of browser client-side code has becomesynonymous with eval and the built-in function has garnerednotoriety However with the advancement of both the web andalong with it JS we increasingly see obfuscated scripts that manglethe source code to conceal the browser API calls to hide the truemotivation of the source code Intuitively this holds true as any JSsource attempting to interact with the browser andor the userrsquosecosystem has to pass through the browser API

In this paper we investigate the nature of JS obfuscation throughits concealing effect of JS browser API features We specifically fo-cus on quantifying obfuscation that is tied to hiding the interactionsof the script with the browser Our approach does not cover all typesof obfuscation as there can exist obfuscated JS scripts in the wildthat do not use any browser APIs We argue that these scripts wouldbe extremely limited as any script with malicious intend or com-plex behavior such as drive-by downloads fingerprintingtrackingscripts crypto-currency mining scripts malicious advertisementshave to use several browser APIs to perform its tasks and achievethe goal of the attacker Additionally JS scripts that do not use anybrowser features are by definition not interesting due to their verylimited capabilities

Our proposed analysis system combines the runtime footprintof JS with the static analysis of its source code to identify JS obfus-cation which requires no ground truth for training purposes We

IMC rsquo20 October 27ndash29 2020 Virtual Event USA Sarker et al

apply our system on pages collected over the Alexa top 100k do-mains to conduct a large-scale measurement study of JS obfuscationon the web ecosystem - including measuring traces of JS featureconcealing obfuscation on the web the origin of such obfuscatedscripts throughout the web most frequently obfuscated JS featuresthrough concealing and state of the art obfuscation techniquesused by real-world JS scripts that do not rely on eval In summaryour main contributions include

bull We propose a novel approach of identifying JS obfuscationthrough feature concealing based on the intuition that theruntime behavior of code should be evident by static analysison source code

bull We present results from a large-scale analysis of the JS fea-ture obfuscation effect in the wild We crawled the Alexa top100k domains and quantified this obfuscation pinpointedthe sources of obfuscation and measured the hidden func-tionality that lies behind such obfuscated API function callsand JS property accesses

bull We identify prominent new obfuscation techniques througha data-driven approach and present case studies that nolonger rely on the notorious eval function

2 DEFINING OBFUSCATIONBarak et al defined program obfuscation as The goal of programobfuscation is to make a program unintelligible while preserving itsfunctionality [5] Extrapolating from this definition JS obfuscationin its simplest form can be defined as when the intentional behaviorof a script cannot be fully realized until execution

Our hypothesis is that if the interactions of code with its underlyingsystem cannot be deduced from static analysis of its source code thenthe code is concealing these interactions and thus can be consideredobfuscated

Our goal with this new definition of obfuscation is to expressthe level of difficulty that static analysis tools and human analystswill have when analyzing code Notice that our definition of obfus-cation is more relaxed and can include unintended concealing offunctionality for example through heavy modification of sourcecode via minification (ie rewriting the source code to be morecompact without changing its functionality) However dynamicanalysis is dependent on the execution flow and thus is not exhaus-tive when it comes to reveal all the capabilities of the code Thiscan be alleviated through techniques such as force execution [37]which is out of scope for this paper

In this work we focus on obfuscation on the web where the codeis JavaScript and the system is the browser Interactions betweenthe code (JS) and the system (browser) are defined as invocationsof browser API features (property accesses or function calls) Forthe rest of the paper when we refer to JS obfuscation it is in termsof this hypothesis and this behavior of concealing browser APIfeatures

We describe our data collection system architecture and the sub-sequent analysis to detect this obfuscation in sect3 and sect4 respectivelyfollowed by a validation of our detection system in sect5 We then ap-ply our detection system to collect and identify obfuscation tracesfrom the Alexa top 100K in sect6 and present our findings in sect7 andsect8

3 DATA COLLECTION ARCHITECTURETo detect JS obfuscation traces in accordance with our definitionwe implemented an automated web crawling system capable ofcollecting JS script execution traces as depicted in Figure 1 and asubsequent post-processing system for the collected traces

31 Web CrawlerOur data collection worker from Figure 1 is a Dockerized containerconsisting of a web crawler and a log consumer The web crawlerdescribed in this section uses a custom build of Chromium pro-ducing JS execution traces in the form of log files (sect32) The logconsumer is a Go-based tool to compress the trace logs and archivethem after a page visit is completed (sect33)

The web crawler is based on the NodeJS library Puppeteer [23]which automates the Chromium browser and communicates withthe browser over the Chrome DevTools protocol [26] The datacollection worker pulls a job with a domain from a Redis queue andproceeds to visit the domainrsquos webpage For each domain usingPuppeteer our crawler launches an instance of our custom buildof Chromium in headless mode opens a tab and navigates to theURL concocted by prepending the domain with http whilemonitoring the progress of the visit We set a fixed 15 second timelimit for navigating to the page and upon finishing navigation weremain on the page for an additional amount of time to let the pageload any resources as needed In our system both the navigationand the subsequent loitering on the page are also limited to a total of30 seconds beginning from the start of the navigation after whichwe tear down the page and the tab followed by the closing of thebrowser instance

During the visit we collect a number of auxiliary information foreach visit - including all network requests made the headers andbodies of all HTTP resources downloaded and all script sourcesparsed through the debugger using standardDevTools protocol com-mands and event handlers over Puppeteer All of this informationare stored into a MongoDB document storage in an asynchronousmanner (as encountered) during the visit of the page

32 Instrumented BrowserBased on our hypothesis to identify the presence of obfuscationconcealing browser API features in JS source code we need toperform dynamic analysis to trace the runtime behavior of a JSscript and compare that behavior to a static analysis of the sourcetext We use VisibleV8 [32] a custom open-source variant of theV8 JS engine powering the Chromium browser to capture browserAPI accesses Much like the Linux strace tool logs Linux systemcalls made by native applications VisibleV8 (VV8) logs browserAPI function calls and property accesses made by JS applicationsVV8rsquos in-browser architecture provides significant coverage androbustness advantages over alternative approaches to JS instru-mentation that rely on JS script injection prototype patching orbrowser extensions Furthermore the fact that VV8rsquos instrumenta-tion is embedded invisibly inside the JS engine itself provides moreintrinsic stealth than generic prototype patching JS performanceunder VV8 can be significantly lower than that of stock Chromium

Hiding in Plain Site Detecting JavaScript Obfuscationthrough Concealed Browser API Usage IMC rsquo20 October 27ndash29 2020 Virtual Event USA

Instrumented V8

Chrome DevToolsProtocol Remote

Session

Custom build of Chromium

UncompressedTrace Log

Files

Crawler Worker

Log Consumer

Data CollectionWork Queue

Data Collection Worker

MongoDBArchives

CompressedLog Data

CandidateDomains

MetaInformation

Web

Figure 1 Data collection pipeline for detecting JS obfuscation

especially on micro-benchmarks[32] but general browsing perfor-mance of JS-intensive applications remained comfortably usableand scalable for our experiments

VV8rsquos instrumentation of browserAPIs (eg Window and Document)to the exclusion of builtin APIs [44] (eg Math and Date) fits wellwith our hypothesis Browser APIs constitute the interface betweenuntrusted JS code and native browser functionality in much thesame way that operating system (OS) system calls constitute theinterface between untrusted user applications and the trusted OSkernel Since JS code cannot do anything ldquointerestingrdquo (eg anyinput or output of private user data or any other interaction withthe user) without invoking browser APIs they furthermore pro-vide a ldquolayer of truthrdquo at which at least fragments of the scriptrsquosintent becomes apparent regardless of how contorted its internallogic may be To get the exact number of available browser API fea-tures we analyzed and processed the WebIDL specification of theChromium browser identifying 6997 unique API features for ouranalysis We refer to the browser API features from this point on inthe paper simply as ldquoAPI featuresrdquo or ldquoJS featuresrdquo interchangeablyfor brevity

While VV8 has traditionally been used in stock Chromium weembedded it into a custom build of the Chromium-based Brave [7]browser to take advantage of Braversquos promising PageGraph tech-nology Still under heavy development PageGraph is an opensource [8] derivative of the technology behind Braversquos AdGraph [28]which involves pervasive instrumentation of both V8 and the Blinkrendering engine at the heart of Chromium In our system Page-Graph complements VV8rsquos low-level tracing of all browser API ac-cesses with high level tracking of script provenance (sect7) includingvia complicated DOM interactions and dynamic script injections

Since both the VV8 and PageGraph instrumentation is built intoa production browser (ie the Chromium-based Brave) data collec-tion can be performed via standard browser automation Automated

browser visits to selected websites results in VV8 trace logs con-taining context information such as effective security origin URLsource code location of API accesses names of the API functionor property and any data being passed into or written to loggedfunctions or properties PageGraph data can be extracted throughan API exposed via the Chrome DevTools interface in Brave

33 Log ConsumerThe log consumer is a Go-based tool which has two responsibilitiesin our system The first as shown in Figure 1 is to compress andarchive the trace logs produced by VV8 during our page visit intoMongoDB The log consumer is independent of the crawler so thatthe crawler can tear down the browser instance and the tab withoutit interfering

Secondly the log consumer is invoked once again as part of thepost-processing of the VV8 trace logs VV8 records the completeJS source code of every script processed through it via the tracelogs and this record is kept exactly once per log for brevity Thepost-processing extracts and archives all such scripts with a uniqueidentifier of script hash into a PostgreSQL system for further anal-ysis The script hash serves as an identifier for the script in thedatabase and it is derived by computing the SHA256 hash of theentire textual source of the script

Additionally the post-processing also extracts and records anAPI feature usage tuple consisting of a distinct combination of thefollowing information

bull Visit Domain the domain being visited by the page itselfbull Security Origin the runtime evaluation of windoworiginwhich could differ from the visited domain based on whetherthe execution context was from an iframe

bull Active Script identified by its script hashbull Feature Offset the character offset within the source forthe API feature usage


FilteringPass

ASTAnalysis

Feature SitesIndirect Sites Unresolved

Indirect Sites

Direct Sites

ResolvedIndirect Sites

Unobfuscated Sites

Obfuscated Sites

Figure 2 Static analysis steps for detecting JS obfuscation

bull Feature UsageMode how the feature was used (ie a prop-erty getset or a function call)

bull Feature Name a combination of the type of the native JSobject and the accessed member (the property or the methodname) of that object (eg DocumentcreateElement)

We commonly refer to the combination of feature name featureoffset and feature usage mode on a particular script as a featuresite of the script

4 DETECTING OBFUSCATIONWith our web crawling and subsequent post-processing of the VV8trace logs providing us with feature usage information we proceedto analyze the API feature sites as described in sect33 We want toexamine these feature sites within the scriptsrsquo source code andcheck whether these can be identified with a static analysis of thesource If any of the feature sites cannot be identified through thisstatic analysis we mark them as a trace of obfuscation concealingan API feature usage since the usage cannot be deduced from thestatic analysis by itself without the trace information

For our analysis we conduct a two-step static analysis over thefeature sites data as shown in Figure 2 We describe each step andthe corresponding analysis in detail in the following sections

41 Filtering PassThe initial filtering pass over the feature site data is based on theintuition that the majority of the feature usages are not obfuscatedand thus can be resolved simply through character offset exami-nation in the textual script source The intention of this pass is tofilter out obvious non-obfuscated feature sites very quickly andmark the feature sites suspected of containing traces of obfuscationfor further static analysis

To achieve this for each feature site (feature name characteroffset and usage) of a script we extract the token at the characteroffset with the same length of the accessed member part of thefeature name from the scriptrsquos source and then compare this tokenwith the accessed member part For example if our feature site tuplehas a feature name of Documentwritewith character offset of 100we extract a token of length 5 starting from the textual source atcharacter offset 100 and ending at offset 104 If this extracted tokenmatches with the accessed member (write in case of our example)part of the feature name we mark this feature site as a direct siteThe underlying assumption here is that to use a native JS APIfeature (via call or property access) without any obfuscation of thesource one would have been likely to refer to the function call or

property access by the accessed member part of the usage tuplein the source In the case of a mismatch we mark the feature siteas an indirect site and as a candidate that requires more involvedstatic analysis

42 Abstract Syntax Tree AnalysisWe subject the indirect feature sites from the filtering pass to acustom heuristic-based static analysis tool This tool makes a best-effort attempt to resolve these sites through the analysis of AbstractSyntax Tree (AST) of the scriptsrsquo textual sources combined withthe variable scope information The aim of this analysis tool isto identify certain basic human intelligible patterns in the scriptsource through which these indirect feature usages could have beenmade A successful resolve attempt indicates either no obfuscationat all or minor indirection that provides weak obfuscation at bestwhich can be understood through manual inspection by a humanexaminer Failure to resolve indicates a certain level of obfuscationwhether deliberate (eg packed or mangled code) or accidental(heavy indirection and dynamism in the scriptrsquos source) Eitherpossibility is covered under our definition of obfuscation in termsof concealed behavior

Human Identifiable Patterns Consider the case of an indirectfeature site involving a function call There are two distinct simpleways to invoke a function in JS given the function name identifiermdashwe can either append a pair of parentheses with the arguments tobe passed to the function enclosed within or we can invoke anyof the call apply bind methods defined on the prototype of theFunction object Since functions are first class objects in JS theycan be treated as any other variables For feature sites making anindirect API function call the feature offset should either contain aregular call on a source token that does not resemble the accessedproperty of the feature name or an invocation of any of the callapply bind methods also on an non-resembling token In eithercase we need to trace back the token to the accessed property ofthe feature name in question

In the case of property accesses due to the dynamic nature of JSthere are myriad possible ways a property access can be performedWe focus on a limited subset of patterns a human examiner canresolve through the inspection of the scriptrsquos source code Thissubset included property accesses as part of a logical expression(eg var a = false || name window[a] = value)an assignment redirection (eg var p = name q = p window[q] = value) or a member access expression on anobject (obj[p] = name window[objp] = value)In the examples the ellipsis designates other lines of source codewithin the script and both the expression and the following refer-ence shares a scope In all these cases again we need to trace backa non-resembling token to the feature namersquos accessed property

The Resolving Algorithm To perform our analysis we parsethe script source into an AST using the Esprima[4] JavaScript parserand identify the variable scopes references and binding expres-sions with the complementary EScope[21] analysis library EScopeprovides all the variable scopes statically derived through the ASTin nested form and can provide the current scope for a given ASTnode with a reference to both the parent scope and the childrenscopes


For each indirect feature site we identify the originating ASTnode by first finding the AST leaf node that contains the targetoffset of the site Next we reach that leaf nodersquos nearest parentnode of the appropriate type member access expression nodes forfeature site involving property retrieval assignment expressionnodes for feature site involving property setting and function callexpression nodes for feature site involving function calls At thispoint our algorithm performs a recursive walk of the AST untilwe can resolve the expression under examination to the accessedproperty part of the feature name or a certain recursion level isreached (in our case this level was 50)

At each recursive AST walk we expand the expression intoAST nodes and kept reducing it by focusing on the particular nodeof interest in the expression - based on the expression subset wedescribed previously until we pinpoint either to a literal node or anexpression node If we reduce to a literal value we check it againstour accessed property of the feature site and report if there is amatch or not

However if we reduce to an expression rather than a literal weinvoke an evaluation routine for the expression to resolve it to aliteral value for checking with our accessed property This evalu-ation routine is a JS interpreter for a subset of the AST structurewhich can potentially be resolved by a human examiner throughinspection This subset includes references to bound identifier vari-ables string concatenations object member accesses array literalsand method calls for which the receiver and all arguments can beevaluated statically

When the evaluation routine reduce the expression to an iden-tifier we search for the variable corresponding to that identifierwithin the nearest enclosing scope as given by EScope Upon find-ing the variable we iterate over the variablersquos references withinthe current scope If the variable has a write expression1 of a literalvalue we check the literal value with the accessed property Oth-erwise we invoke the evaluation routine recursively on the writeexpression

We mark the indirect feature site as resolved when we could finda match to the resolved literal value for the accessed property If weencounter an expression outside of our subset during the recursivewalk or expression evaluation or the recursion depth reaches themaximum level or the resolved literal value does not match wemark the indirect feature site as unresolved These unresolved fea-ture sites remaining after the AST-based analysis on our indirectsites contain feature usages that cannot not be deduced from staticanalysis of the source and thus we identify the script to containfeature concealing obfuscation1 var global = window2 var prop = Left Rightsplit( )[0]3 global[client + prop]

Listing 1 Example for expression evaluation routine

Listing 1 shows an example scenario to explain how our evalua-tion routine works Given we are aware of the property access atline 3 (windowclientLeft) through our trace logs we invoke ourroutine to evaluate the member access expression Since this is astring concatenation expression with a literal and an identifier we

1In EScope terminology write expressions are assignments to a bound variable withina scope

recursively invoke the evaluation routine on the identifier Fromour variable scope analysis we see that this identifier has a writeexpression of an array indexing expression over a method callexpression on a literal value We recursively invoke our evaluationroutine on these expressions until we reach the literal value andon our way back on the recursion we evaluate each expressionuntil we finally evaluate the prop variable to the literal value ofLeft We finally evaluate our string concatenation expressionwith the value of prop to be clientLeft and since this matchesour accessed property of the feature name we mark the feature siteas resolved

However the script excerpt in Listing 1 can be argued to bepartially obfuscatedmdashdepending on the subjective judgment of theexaminer Due to the aggressiveness of our AST-based resolvingalgorithm we can confidently claim that any unresolved featuresite after passing though our analysis should be obfuscated andour detection system provides a conservative bound of obfuscationtraces

5 VALIDATING THE HYPOTHESISIf we assume our hypothesis (sect2) to be true then the following twosub-hypotheses should also hold 1 for a script completely devoid ofany obfuscation we should be able to identify all executing browserAPI calls and property accesses through static analysis of the source2 for an obfuscated script we should be able to observe executingbrowser API calls and property accesses that we cannot resolve throughstatic analysis of the code

To test these two sub-hypotheses we selected a set of candidatescripts and a set of web domains on which they are statically in-cluded set up a specialized web-crawling system through which wevisited our selected domains and performed the analysis describedin sect4 In the following sections we describe the candidate selectionprocess and the candidates themselves followed by the analysisresults of our detection system

51 Candidate SelectionTo test our first sub-hypothesis we required scripts without any ob-fuscation and also preferably without any compression or minifica-tion as some compression orminification tools such as UglifyJS[24]can perform a certain degree of optimization during the compres-sion phase that can introduce obfuscation by altering the logicalstructure of the script We refrained from using the minified ver-sions of the scripts available as there was no way to determine thetool and the level of minification used for these scripts The bestinitial candidates we found were the developer versions of popularthird-party JS libraries These scripts are typically provided withoutany minification and are used for debugging purposes

To identify our candidates we focused on one of the biggestcontent delivery network (CDN) system for popular JS third-partylibraries Cloudflarersquos cdnjs [19] which serves 19 of the top one-million websites [54] We picked the most-downloaded uniquelibraries through cdnjs based on the download statics for the monthof September 2019 [20] and filtered any libraries that are not JS (egfont-awesome) or libraries that do not provide developer versionsof their source code (eg gsap) We came up with 15 libraries afterthis filtering listed in Table 7 in appendix A


However in most real-world websites developers opt to use theminified versions of third party libraries to improve network per-formance and thus the developer versions of the script are barelyif at all used in real-world websites To select the domains usingany of the third-party libraries of our selection we retrieved boththe developer and minified source code for all semantic versionsavailable through cdnjs and computed the SHA-256 hash value forthe minified source - giving us 545 distinct hash pairs We then per-formed a search for these 545 hash values in our previously crawleddataset for the Alexa top 100K domains (crawled within the firstweek of October 2019 - we describe this crawl in sect6) which con-tains the DOM content and the scripts present on the root landingpage of each of the domains along with their corresponding SHA-256 hashes The matched domains constituted the set of candidatedomains where we could use the developer versions of the scriptsand have the page behave similarly We found a total of 41055 do-mains with hash value matches for 207 distinct semantic versions(obtained from cdnjs) of our 15 libraries (Table 8 in appendix B)

For each of the libraries we took the 10 highest ranked domains(irrespective of the specific included semantic versions) as candi-dates except for lodash which had eight matches only all of whichwere taken as candidates for our validation system We excludedpopperjs as we found only a single domain with a hash valuematch in our dataset thus giving us 138 domains covering 64 se-mantic versions of these 14 libraries After de-duplication we endedup with 116 distinct domains as our candidate set for the validationsystem to visit with the developer version of the libraries

For the second sub-hypothesis we decided to use deliberatelyobfuscated versions of our developer version JS scripts This waywe can also demonstrate that obfuscation does result in concealedAPI calls or property accesses To obtain the deliberately obfuscatedversions of our developer version scripts we used the widely popu-lar JS obfuscation tool JavaScript Obfuscator [22] which has both aweb-interface [48] and a command line tool through npm packagemanager with weekly downloads averaging closer to 20k [45] Webased our tool selection by the comparative obfuscation tool studyconducted by Skolka et al in their work [53] where JavaScriptObfuscator was the top tool with the majority of highly used ob-fuscation features

52 Record amp Replay WebpagesFor the validation system we needed a way to visit the candidatedomains twice - once with the developer versions of the candidatescripts and again with their tool-assisted obfuscated counterpartsHowever the domains as we have seen used the compressed ver-sions of the candidate scripts To circumvent this issue we reliedupon the Web Page Replay (WPR) tool [25] part of the catapultproject [18] from Google WPR is a Go-based tool that when usedin record mode positions itself between the browser and the web asa proxy The Chromium browser instance can connect to the WPRserver over a proxy connection and any subsequent web page visitduring the connected session is recorded in a compressed archivewhich contains all requests and the corresponding responses be-tween the browser and the web for a specific session Similarlywhen WPR is run in replay mode with a prior recorded archivethe browser can reenact all request responses exactly as performed

during the record session given the requests are present in thearchivemdashthus reconstructing the exact same web page in effectduring the recording We used the data collection system describedin sect3 in combination with theWPR server for our page visits duringvalidation

In our system we visited each of our candidate domains threetimes once in record mode and twice in replay mode The visit inrecord mode was to create the archive for the candidate domainsrsquowebpages with the candidate scriptsrsquo requests later to be used forreplaying the page Our crawler launched the WPR server andadded the proxy details to the Chromium browserrsquos launch optionsAfter the page visit was completed the crawler closed the WPRserver triggering it to write the recorded archive in a file on disk

During our recording visit2 for the 116 candidate domains wefound 57 distinct semantic versions of our 14 candidate libraries outof the initial 64 distinct versions matched in our Alexa crawl data -this is most likely due to the websites updating their third-partyscripts during the time between our Alexa crawl and the validationcrawl

We visited each of the candidate domains twicewith this recordedarchive data - once for the developer versions of the scripts andagain for the tool-assisted obfuscated versions of the scripts Wewrote a Go-based tool named wprmod to alter the WPR recordarchivesrsquo request-responses given the SHA-256 hash of the responsebody to replace and the actual content in textual format to replaceit with With this altered archive WPR replays provided the re-placed content for the same request performed during recordingOur replay runs replaced the used versions of the scripts with ourdeveloper and obfuscated versions respectively and visited thecandidate pages with this altered archive

However we observed that in our recorded archives some ofthe requestsrsquo responses had compression encoding scheme mis-matches For example in the request header it mentioned gzip ascompression encoding and then provided a response body withencoding utf-8 These were clearly server configuration errorsfrom the site developers and caused a few decompression errorin our wprmod tool where we did not rewrite these the requestrsquosresponse body in such cases So during our replay for the developerversions of the scripts we replaced 51 distinct semantic versions ofthe 57 compressed versions

To generate our deliberately obfuscated scripts we passed the de-veloper versions of the scripts through the command-line version ofthe JavaScript Obfuscator tool The authors of the tool provide threepreset configurations for the tool and we used the most popularconfiguration with medium obfuscation and optimal performancefor generating our deliberately obfuscated versions of the candidatescripts We refrained from using the maximum obfuscation settingto keep script breakage to a minimum - only 34 out of 51 versionsof developer scripts did not result in either a time-out or execeptionwith maximum settings We replaced 50 distinct semantic versionsout of the 51 versions of developer scripts - only a single library(json3 version 332) failed to parse with the Javascript Obfuscatortool

2Done in October 2019


Developer ObfuscatedDirect 3050 250Indirect - Resolved 15 757Indirect - Unresolved 20 2009Total 3085 3012

Table 1 Breakdown of feature sites after the two-step anal-ysis on candidate scripts

53 Validation ResultsIn our post-processed data we had 3085 and 3012 distinct APIfeature sites from the 51 developer version scripts and the 50 ob-fuscated scripts respectively We applied our two-step detectionsystem as described in sect4 on these features sites Table 1 shows thebreakdown of the feature sites over our developer and obfuscatedversions of the candidate scripts after our analysis

We manually checked the 20 indirect feature sites on the de-veloper versions of the candidate scripts that were marked unre-solved by our analysis We found that all of these feature sites wereproperty accesses through a wrapper function of the form f =function (recv prop) recv[prop] Using thisthe property accesses were made by invoking this function egf(window location) where the function invocations werenot necessarily part of the script itself This was an indirectionmechanism which could only be resolved by a human examinerif the examiner had access to the entire call stack However staticanalysis of variable scope is incapable of evaluating callee argumentvalues through the call expressions Based on this we classifiedthese as legitimate unresolved feature sites

Given the sheer number of unresolved feature sites present inobfuscated candidate scripts (2009mdash6670 of total sites) comparedto the developer versions (20mdash064 of total sites) after our anal-ysis system we concluded that both of our sub-hypotheses holdand establish that comparing dynamic trace information with staticanalysis for API features is a viable way to reveal feature concealingobfuscations in JS scripts

6 ALEXA TOP 100K DATA COLLECTIONTo measure the effect of JS obfuscation through concealed APIusage on the web we reused our crawling system from sect3 to collectdata from the Alexa top 100K domains3 We deployed this crawlover a Kubernetes cluster with 80 CPU cores and 512GB of memoryconnecting to the internet from our university network

We queued theAlexa top 100k domains excluding the 37 Punycode-encoded [12] domain names that our queuing logic did not processto our web crawler as described in sect31 Of the queued domains85470 completed successfully ie the crawler completed the visitwithout error Table 2 shows the major broad categories of the14493 page visit failures Most of the network failures occurreddue to the domain name not being resolved indicating presence ofstale domains on the Alexa list followed by a variety of ephemeralnetwork errors including DNS lookup failures TLSSSL errors andtransport level errors (connection resetrefused) The vast majorityof PageGraph issues were triggered by PageGraphrsquos conservative

3https3amazonawscomalexa-statictop-1mcsvzip - as retrieved on Sep 24 2019

Page Abort Category Category CountNetwork Failures 5431PageGraph Issues 4051Page Navigation (15s) Timeout 3706Page Visitation (30s) Timeout 1305Total 14493

Table 2 Categories for Alexa top 100K domains visit errorsbreakdown

Category Distinct ScriptsNo IDL API Usage 177305Direct Only 787599Direct amp Resolved Only 43048Unresolved 75851Total 1083803

Table 3 Breakdown of all unique scripts from Alexa top100K crawl after the analysis

internal correctness assertions aborting the page load with a fewbreakages encountered due to the page not having the necessarycontent to generate a PageGraph The page navigation and visita-tion timeouts happened when our preset time limits during pagevisits exceeded as described in sect31

After the post-processing run on the collected trace logs weextracted and archived 11120829 JS scripts encountered in ourinstrumented Chromium browser from 84260 distinct domains ofthe 85470 domains successfully visited out of which 3222053 hada unique script hash Our use of dynamic analysis and instrumen-tation focused tightly on browser APIs limited the population ofscripts for which we had feature site data for 1083803 distinctscripts from 77423 distinct domains

7 RESULTSTable 3 summarizes the volume of scripts at each level of our analy-sis workflow In this case ldquoNo IDL API Usagerdquo means our instrumen-tation detected native function or property access (eg accessingthe global object) but that no standard IDL-defined browser fea-tures were invoked Direct only includes scripts in which all featuresites were cleared through the filtering pass of our analysis Directamp resolved only scripts include both direct and some indirect featuresites - all of which were successfully resolved through our AST-based analysis The remaining unresolved scripts show at least oneunresolved indirect feature site constituting to our set of obfuscatedscripts Note that in the following discussion when we mentionresolved scripts we refer to the set of scripts that do not containany unresolved indirect sites after our analysis Similarly whenwe refer to obfuscated scripts we refer to the set of scripts withunresolved feature sites

71 Obfuscation PrevalenceTo approximate the extent of obfuscated JS scripts through featureconcealing in the wild we consider the two values from the anal-ysis of our collected data the number of scripts with unresolved


AlexaRank Domain Unresolved Total

79633 11alivecom 55 22057593 sportunefr 49 25064969 racingjunkcom 49 29675354 kron4com 48 22347511 ovaciondigitalcomuy 47 254

Table 4 Top 5 domains by number of obfuscated scripts vstotal scripts loaded on that site

indirect feature sites out of the total analyzed script population andthe popularity of such scripts across all visited domains We findthat the majority of the scripts are without obfuscation traces inaccordance with the intuition and a relatively small population ofscripts containing obfuscated feature sites is encountered on almostall top websites

We found that of the 77423 domains for which we had scriptdata out of the Alexa top 100k only 3178 (410 ) did not loadobfuscated scripts The vast majority of these 74245 domains(9590 ) contained at least one obfuscated script In Table 4 weshow the top five web domains loading the most obfuscated scriptsalong with the total number of scripts loaded by that site orderedby the sitersquos Alexa rank We note that four out of the five sites arenews media sites (local news sporting events) Online news sitesare of course notorious for heavy use of aggressive advertising(and associated tracking) content - that such sites are the mostprolific users of observed obfuscated scripts is unsurprising

72 Context and Origin of ScriptsTo understand how and from where obfuscated scripts are loadedwe leverage metadata from our instrumentation to compare theexecution contexts source origins and loading mechanisms of bothobfuscated and resolved scripts We find that obfuscated scriptstypically come directly from 3rd-party sources despite executing ina 1st-party security context at nearly the same proportional rate asresolved scripts

Script Loading Mechanisms PageGraph[28] provides scripttype annotations which indicate how a script was loaded - viaexternal URL inline inclusion in static HTML or dynamic inlineinjection via various DOMAPIs From these annotations we discov-ered contrasting differences in how resolved and obfuscated scriptswere loaded during our experiment We saw obfuscated scriptsloaded overwhelmingly (98) via external URL (ie script tags withexplicit http(s) URL src attributes) Conversely resolved scriptsshowed more diversity of loading mechanism 59 from externalURLs 26 from inline code in HTML documents 7 generatedinline via documentwrite 5 generated inline via DOM API callsetc Obfuscated feature sites are thus heavily concentrated in exter-nal scripts (eg advertising and tracking frameworks compressedlibraries from CDNs loaded from other 3rd parties) compared toapplication specific or bootstrapping script source code directlyembedded in (or injected into) HTML documents

1st vs 3rd Party DefinedWe consider two axes of distinctionwhen analyzing how obfuscated scripts are loaded and executedMost obviously scripts may be loaded from the 1st-party domain

(ie the domain we are visiting) in contrast to some other 3rd partydomain This 1st vs 3rd party categorization of a scriptrsquos originURL (the URL the script was loaded from - if any exists) is distinctfrom the scriptrsquos security execution context which is the securityorigin as extracted from the dynamic traces mentioned in sect33

The browser enforces a Same Origin Policy (SOP) to preventdocuments and sub-documents (such as iframes) loaded from dif-ferent origins (ie URL scheme host name and port number) cannotaccess each othersrsquo DOM trees In our case we compare only theeTLD+1 (public suffix plus one subdomain eg ldquoexamplecomrdquo forldquosubexamplecomrdquo) of domain names not the full origin for 1st vs3rd party association This approach differs from the browserrsquos offi-cial SOP but it has the advantage of revealing close relationshipsbetween related subdomains

Execution ContextWe consider a script to have 1st party exe-cution context if the eTLD+1 of the runtime evaluation of Windoworiginmatches that of the visiting domain Based on this among resolvedscripts 4911 were loaded in 1st party context compared to 5075in 3rd party context Obfuscated scripts showed nearly the samebreakdown 4847 1st party to 5127 3rd party loading That ob-fuscated scripts are loaded and executed with 1st party privileges atnearly the same proportional rate as more readable code raises ques-tions about the amount of trust afforded to JS scripts deliberatelyconcealing its range of potential activity

Source Origin For our scripts we categorized scripts withsource origin URL that have the same eTLD+1 as the visiting do-main to be 1st party source origin scripts In the case of the scriptnot having a source origin URL we recursively look for the par-ent script responsible for this script either though dynamic DOMmanipulation or eval and use the source origin URL of the parentscript During this ancestral recursive walk if we encounter thatthe parent is not a script rather a document or sub-document (egiframe) - indicating the the child script was included in the docu-ment in an inline manner we simply fall back to the security originURL of the document

We found obfuscated scripts to have 3rd party source originsmore frequently than the resolved scripts In our dataset 7855of obfuscated scripts had 3rd party source origins compared to6177 for resolved scripts This disparity in favor of 3rd partysource origins again corroborates that obfuscated JS scripts typicallyoriginates from 3rd party domains rather than created and hostedlocally

73 Feature Site Obfuscation and evalGiven the long association of eval with obfuscated and maliciouscode [10 57 58] we wanted to explore the relationship betweenfeature site obfuscation and use of eval in the wild Note that ourmethodology does not attempt to classify eval use as obfuscated orunobfuscated at all so we are not attempting a direct comparisonbetween obfuscation mechanisms But we report on the overallvolume of eval activity observed to compare it with the volumeof unresolved feature sites and obfuscated scripts

In this context if a script uses eval to load another script werefer to the script performing the eval as the parent and the scriptloaded via eval as the child Out of the 1083803 distinct scriptsanalyzed we found 69163 total distinct eval children scripts from


Feature Name ObfuscatedPerc Rank

DirectPerc Rank

Elementscroll 9611 4590HTMLSelectElementremove 8715 5076Responsetext 8996 5572HTMLInputElementselect 8823 5626ServiceWorkerRegistrationupdate 8790 5767Windowscroll 9212 6436PerformanceResourceTimingtoJSON 8261 5508HTMLElementblur 9654 6976Iteratornext 8499 5864NavigatorregisterProtocolHandler 9104 6501Table 5 Top 10 API functions accessed via obfuscation

21380 total distinct eval parents When we considered only theset of obfuscated scripts we found 1901 distinct obfuscated scriptsresulting from eval (275 of all distinct children) and 5028 dis-tinct obfuscated scripts performing eval (2352 of all distinctparents) Strikingly while in the general population of analyzedscripts eval children outnumber parents more than 3 to 1 amongobfuscated scripts the relationship is reversed with eval parentsoutnumbering children more than 2 to 1 In other words obfuscatedscripts are more likely to use eval to load scripts than they are to beloaded by eval in the first place

Recall that these statistics reflect not necessarily obfuscated evalusage but all eval usage observed in our dataset While we makeno effort to classify eval usage as obfuscation or not the num-bers provide a comparative upper bound on how much eval-basedobfuscation could have existed in our dataset Even if every evalparent observed were assumed to be a case of obfuscation (which isalmost certainly not true)we still observed significantly more distinctinstances of feature site obfuscation (75851 vs 21380 ) This is in ac-cordance with the intuitive observation that eval is a well-knownfootprint of obfuscation even among static detection systems pos-sibly resulting in this shift away from eval by the obfuscation toolsand actors

74 Frequently Obfuscated APIsTo gain insight into what browser API interactions tend to be ob-fuscated in the wild we compare API function popularity acrossresolved and unresolved (ie obfuscated) feature sites

Popularity Comparison Mechanism To identify distinct ob-fuscated features (ie both function calls and property accessesmore likely to be accessed from unresolved than resolved featuresites) we first computed each distinct API feature namersquos percentilerank (ie popularity) among resolved (Pr ) and unresolved (Pu )feature sites We then compute the percentile ranks difference(|Pu minus Pr |) for each feature name giving a score that is higherwhen the feature is more popular among unresolved than resolvedfeature sites APIs showing the highest percentile rank differenceswere more likely to be used in an obfuscated way in our data How-ever since low frequency outliers in either list can radically skewthe scores we further filtered out the feature names with highestscores by removing all API features with total global access countbelow 100


DirectPerc Rank

UnderlyingSourceBasetype 9845 3089HTMLInputElementrequired 9491 5689NavigatoruserActivation 8877 5242StyleSheetdisabled 9200 5695CanvasRenderingContext2DimageSmoothingEnabled 8468 5000Documentdir 8976 5676HTMLElementtranslate 8679 5465HTMLTextAreaElementdisabled 8666 5465DocumentfullscreenEnabled 9597 6520BatteryManagerchargingTime 9107 6073

Table 6 Top 10 API properties accessed via obfuscation

Distinctly Obfuscated APIs We initially identified 923 dis-tinct API functions and 1608 distinct API properties accessed viaresolved feature sites while 320 distinct functions and 639 distinctproperties were accessed in an obfuscated manner In Table 5 wepresent the top 10 functions by gain in percentile rank ordered bydescending rank gain Out of these 10 functions 5 are associatedwith simulating user-interaction or manipulating user input formsAmong the rest are APIs associated with performance profilingJS-initiated network requests ServiceWorkers and registration ofcustom URL scheme handlers (which requires user consent) InTable 6 we present the top 10 properties by gain in percentile rankalso ordered by descending rank gain Of these 4 are associatedwith user interaction detection or other user interface manipulation4 with obscure DOM metadata and 1 with media streaming Thelast is part of the infamous BatteryManager API that was hastilydeprecated for privacy reasons [42 47]

8 OBFUSCATION TECHNIQUES IN THEWILDHaving identified the scripts with obfuscation traces we focused onthe obfuscation techniques used for concealing the feature usagesTo extract and identify the most prominent obfuscation techniquesused in real-world obfuscated scripts we build an automated systemthat processes unresolved feature sites and automatically identifiesgroups of similar feature sites Our system is capable of assigningeach cluster a score which highlights clusters that are indicative ofrepresenting a specific obfuscation technique

81 Feature Site ClusteringClustering Process To apply clustering on our unresolved featuresites we needed to extract feature vectors from our 491909 un-resolved feature sites over 75851 scripts from our analysis of theAlexa top 100K crawl data Since we were clustering the featuresites instead of the entire scripts we wanted to extract feature vec-tors from only the relevant portion of the script contiguous to thefeature site which we termed a hotspot For each unresolved featuresite of a script we parsed the script into tokens using Esprima[4]and identified the token containing the character offset of the fea-ture site The hotspot of the feature site consisted of r tokens beforeand after the containing token including the containing token itselfwhere r was the radius of the hotspot We then proceeded to createa vector from the 2r + 1 tokens of the hotspot in terms of tokentype frequencies resulting in a vector of 82 dimensions for each ofthe unresolved feature site


5 15 25 50 100Hotspot Radii

055

060

065

070

075

080

085

090

Mea

n Si

lhou

ette

Sco

re

50

75

100

125

150

175

200

225

Noise

Figure 3 Mean silhouette score vs noise percentage of DB-SCAN runs over different radii of feature site hotspots

We then applied off-the-self DBSCAN (epsilon = 05 min sam-ples = 5 distance metric = euclidean) [52] a density based partialclustering algorithm on our hotspot vectors to generate the clus-ters Since our vectors depended upon the value of the radius r we ran the clustering with different radius values Figure 3 showsthe clustering results in terms of noise percentages (percent of fea-ture sites not belonging to any cluster lower is better) and meansilhouette scores (average measure of all cluster cohesiveness andisolation out of 100 higher is better) of different radii As can beseen smaller radius values performed better as they were likelyto exclude tokens non-relevant to the obfuscated feature site intothe hotspot We selected the clustering labels for radius value 5resulting in 5741 clusters with a noise percentage of 433 and amean silhouette score of 09212

Ranking Clusters After the initial clustering we wanted tomanually inspect clusters containing prominent obfuscation tech-niques To select the candidate clusters we assigned each clustera diversity score and ranked the clusters based on this score Theidea behind the diversity score was that the popular obfuscationtechniques were applied to a large number of scripts concealingalso a large variety of API features The diversity score of a clus-ter was the harmonic mean[56] of the distinct number of scriptsand the distinct number of feature names within the cluster forall feature sites belonging to the cluster thus the value of diver-sity score for a cluster would be higher when both constitutingvalues were larger The top 20 clusters by diversity score contained65595 unique scripts with unresolved feature sites (8648 of totalunique scripts with unresolved sites) With our clusters ranked wemanually inspected 20 randomly sampled scripts from each of thetop 20 clusters having the highest diversity scores with no falsepositives We describe our findings in the following section

82 Observed Obfuscation TechniquesIn this sectionwe describe the prominent obfuscation techniquesweidentified through our inspection of the top-ranked clusters of theunresolved feature sites All of these techniques described here relyon complex logical structures and convoluted string manipulationsto hide the API functionalities being invoked We named eachtechnique based on the termwe gave the core logical structure of the

technique None of these techniquesmake use of the notorious evalthe function almost synonymous with JS obfuscation marking ashift in how JS code is obfuscated that went unnoticed The excerptsshown here are from our datasetmdashwe removed the white-spaceminification for clarity and some excerpts were curtailed for brevity(designated by the use of the ellipsis)Technique 1 FunctionalityMap This was by far the most preva-lent obfuscation technique we observed through our inspectionThis obfuscation technique begins with an array containing all pos-sible invocations throughout the script in string literal formmdashthefunctionality mapmdashfollowed by a rotation routine that manipulatesthe order of the array so that the actual indexes are only knownduring runtime This rotated mapping is then leveraged by the ac-cessor function which performs the actual lookup into the map toretrieve the particular functionality to invoke (Listing 2 in appendixC) Using the combination of these two the API functionalities areinvoked throughout the script For example this is appending aDOM node to the document body using the Documentappend APIfunctiondocument[_0x5a0e(0x3a)][_0x5a0e(0x17)]()

We found a number of variations of this technique during ourmanual inspection 1) no use of a rotating routine 2) the accessorfunction was a simple index lookup into the rotated functionalitymap and 3) no use of an accessor function the functionality mapwas accessed using direct indices in octal form We identified fouroff-the shelf JS obfuscation tools capable of one or more variationsof this technique [15 29 46 59] which coincides with the findingsof Skolka et al [53] which they termed String Array feature of thetools According to our clustering 36996 scripts contained at leastone variation of this techniqueTechnique 2 Table of Accessors This technique relies upon astring manipulation decoder function that can create the functionnames or property accesses used in the script in string literal formfrom an encoded string and an adjustment offset (Listing 3 in ap-pendix D)

Using this decoder function b(nslcLe 15) gets translated tocharAt Then using this function a table is created which consistsentirely of calls to this accessor function with the specific argu-ments to concoct the functionalities to be used in the script as a =[ b(nslcLe 15) b(msvvy 19) b(enaqbz 13)b(Qejp32Wnnwu 4) ] The rest of the script invokes APIcalls or accesses properties based on the table indices For examplethis is retrieving the document cookies through the global windowvariable window[a[130]][a[868]] We identified 22752 uniquescripts with this technique in our manually inspected clustersTechnique 3 Coordinate Munging This core of this techniqueis a decoder function and a table of numerals (coordinates) to feedinto it (Listing 4 in appendix E)

The technique then creates multiple instances of the wrapperfunctions to use throughout the script var f = (new N)d c= (new N)d Using these wrapper functions all the APIinvocations and property accesses are performed through the restof the script For example (we are using ellipsis to avoid using verylong identifiers in code) f(dR5) translates to setTimeout andis invoked using window[f(dR5)] There were 1452 uniquescripts with this technique in our inspected clusters


Technique 4 Switch-blade Function This technique is centeredaround a function consisting of a switch-case function that is re-turned using logical obfuscation that performs the duty of a decoderfunction (Listing 5 in appendix F) The technique then involves cre-ating multiple wrapper functions that act as executor for the switch-blade function (Listing 6 in appendix F) Using these executor func-tions invocations are made through the rest of the script For exam-ple windowdocument can be accessed as window[Z4EEx7K(28)]The number of unique scripts with this technique in our inspectedclusters was 1123Technique 5 Classic String Constructor This is one of the clas-sical string manipulation techniques - which relies on a decoderfunction to translate numerical values to concoct a string literalThe variations of this we observed included an offset argumentpassed in for the numerical values to translate Here are two vari-ations of the function that we observed in our clustering (both ofthem do the exact same thing with minute variations) as shown inListing 7 in appendix G

With these decoder functions setTimeout can be concoctedthrough the call z(36 151 137 152 120 141 145 137147 153 152) and subsequently the function can be invokedusing window[z(36 151 )] We found 3272 unique scriptswith both variations of this technique in our inspected clusters

9 LIMITATIONS amp DISCUSSIONOur definition of JS obfuscation and the subsequent detection sys-tem were intrinsically dependent on dynamic analysis which itselfis limited by the particular execution flow taken and thus does notreveal all the capabilities of the code In our collection methodologywe did not generate inputs or simulate human browsing behaviorso the script execution through the trace logs was not exhaustiveHowever JS code that executes on page load would always be ob-served in our system and it is this code that performs interestingand security-relevant behaviors like loading third-party widgetsand ads Exhaustive JS code coverage through execution [37] is atangential problem and out of scope of this paper

The VV8 instrumentation system explicitly traces only browserAPI interactions and global object manipulations This restrictionsuits our hypothesis and the presented obfuscation definition andanalysis well but it imposes the limitation that we cannot readilycompare the obfuscation of browser API feature sites with featuresites for built-in JS APIs (eg StringfromCharCode) Furthermoreto the best of our knowledge there does not exist a tool that can giveus full stack trace of execution for the JS API triggered which deniesus context information that limits our static analysis Howeverthe former limitation has no impact on the validation of our corehypothesis and the latter affected only a negligible fraction offeature sites in our validation experiment We do not considereither of these external limitations to impact our methodologysignificantly

Due to the hybrid nature of the analysis used in this paper it ishard to establish any concrete comparison to prior works in thefield of obfuscation detection However because of our systemrsquosnot relying on ground truth data or specific trained models oursystem does not suffer from the limitations of the large body ofprior work as we are able to detect obfuscation without having

knowledge of it in a ground truth set or even requiring retrainingof the model(s) involved

10 RELATEDWORKObfuscation Identification Prior work in this area falls intotwo major categories 1) identifying JS obfuscation as part of themalicious JS footprint and 2) identifying JS obfuscation by itselfas a transport In the first category a number of studies usedtrained classifier(s) on statically extracted features from the JSsource [16 30 31 40] But there also exist systems using non-statictechniques casual relations finding [2] string pattern based analy-sis [36] A few studies combined both static or dynamic features toidentify obfuscated malicious activity trained classifier on staticand dynamic features [14] ensemble of classifiers as a filter formalicious URLs [9] anomaly detection to identify drive-by down-load attacks [13] Our system in comparison did not rely on anyclassifier training thus avoiding the requirement for ground truthset of already labeled JS source This enables us to detect unknownobfuscations while prior systems could only identify obfuscationtraces seen in the labeled set

There are some studies which attempted to identify JS obfusca-tion by itself Choi et al performed static string pattern analysis toidentify JS obfuscation automatically on a web page level [10] TheNOFUS [33] system was built on the ideas similar to ZOZZLE [14] toclassify JS obfuscation based on static hierarchical features from theAST Similarly JSOD [3] used a Bayesian classifier on context basedfeatures from the AST to classify readably obfuscated JS scripts andSkolka et al [53] used neural network based approach on enrichedASTs to identify obfuscation and minification footprints by specifictools in JS sources Our system did not solely rely on static analysisand our detection process did not use any trained classifier butinstead we relied on the fundamental property that API usage thatwe see dynamically from a script should be evident in its sourcecode This also removes any requirement of retraining the modelswith newly available data unlike a large portion of the existingwork

Hybrid Analysis There are a few prior studies utilizing thecombination of static and dynamic analysis JStill [58] combinedruntime bytecode analysis with static information to identify obfus-cated malicious functions in JS source In contrast we used tracefootprints of JS API function calls and property access to identifyobfuscation Li et al combined both static and dynamic analysis todetect malicious JS contained in pdf files [39] Xu et al conducted ameasurement study [57] of obfuscation techniques over 1039 mali-cious samples from VirusTotal[55] In our measurement study weperformed on a much larger scale (Alexa top 100k domains) andwe did not consider the intent of the scripts we processed in termsof maliciousness

Deobfuscation A modest amount of research exists on remov-ing JS obfuscation from scripts This includes Maude a static rule-based JS rewriter to deobfuscate JS source[6] and the dynamicsemantic-based JS deobfuscator by Lu et al[41] However JSDESattempted to perform automated deobfuscation on only maliciousJS by first identifying obfuscation set manually followed by tracingthe functions capable of generating code dynamically within theobfuscated set and simulating their execution to have deobfuscated


code [1] In contrast our system used both static and dynamic anal-ysis to identify obfuscation without any detection of its intent andwas not concerned with any levels of obfuscation removal

11 CONCLUSIONIn this paper we have presented a novel definition of JS obfusca-tion in terms of browser API features concealing We centered ourdefinition on the fundamental hypothesis that if we observe someruntime functionality in a script then we should be able to stati-cally identify the responsible code that triggered that functionalityotherwise the script is hiding its runtime behavior We presenteda system to detect JS obfuscation based on our definition and vali-dated our hypothesis using this system Additionally we crawledthe Alexa top 100k domains to measure the prevalence of featureconcealing JS obfuscation in the concurrent web ecosystem Ourresults showed that a significant number of domains (9590 of thedomains we visited) contained at least one script which hides func-tionality We observed that this feature concealing effect is morewidespread than eval we demonstrated a data-driven system todiscover several previously unseen obfuscation techniques

Our work indicates a shift in how JS code conceals functionalityon the web which significantly affects current security analysistools and the effort needed from human analysts to study the webObfuscation that is based on code generation through eval is fad-ing away as more sophisticated obfuscation techniques are beingemployed Future web security measurements and tools would needto take this shift into consideration in order to deal with evolvingweb threats

ACKNOWLEDGMENTSWe thank the anonymous reviewers and our shepherd Zubair Shafiqfor their helpful feedback This work was supported by the Officeof Naval Research (ONR) under grant N00014-17-1-2541 and by theNational Science Foundation (NSF) under grant CNS-1703375

REFERENCES[1] Moataz AbdelKhalek and Ahmed Shosha JSDES An Automated De-Obfuscation

System forMalicious JavaScript In Proceedings of the 12th International Conferenceon Availability Reliability and Security - ARES 2017

[2] IA Al-Taharwa CH Mao HK Pao KP Wu C Faloutsos HM Lee SM Chenand AB Jeng Obfuscated malicious javascript detection by causal relationsfinding In Advanced Communication Technology (ICACT) 2011

[3] Ismail Adel AL-Taharwa Hahn-Ming Lee Albert B Jeng Kuo-Ping Wu Cheng-Seen Ho and Shyi-Ming Chen JSOD JavaScript obfuscation detector In Securityand Communication Networks 2015

[4] Ariya Hidayat ECMAScript parsing infrastructure for multipurpose analysishttpsesprimaorg Accessed 11-12-2019

[5] Boaz Barak Oded Goldreich Rusell Impagliazzo Steven Rudich Amit Sahai SalilVadhan and Ke Yang On the (Im)possibility of Obfuscating Programs In AnnualInternational Cryptology Conference 2001

[6] G Blanc R Ando and Y Kadobayashi Term-Rewriting Deobfuscation for StaticClient-Side Scripting Malware Detection In Mobility and Security (NTMS) 2011

[7] Brave Software Features | Brave Browser httpsbravecomfeatures Accessed11-15-2019

[8] Brave Software PageGraph Acirců bravebrave-browser Wiki httpsgithubcombravebrave-browserwikiPageGraph Accessed 11-15-2019

[9] Davide Canali Marco Cova Giovanni Vigna and Christopher Kruegel Prophiler A Fast Filter for the Large-Scale Detection of Malicious Web Pages Categoriesand Subject Descriptors In Proceedings of the International World Wide WebConference (WWW) 2011

[10] YoungHan Choi TaeGhyoon Kim SeokJin Choi and Cheolwon Lee Automaticdetection for javascript obfuscation attacks in web pages through string pattern

analysis In International Conference on Future Generation Information Technology2009

[11] Christian S Collberg and Clark Thomborson Watermarking tamper-proofingand obfuscation-tools for software protection IEEE Transactions on SoftwareEngineering 2002

[12] A Costello Punycode A Bootstring encoding of Unicode for InternationalizedDomain Names in Applications (IDNA) RFC 3492 RFC Editor March 2003

[13] Marco Cova Christopher Kruegel and Giovanni Vigna Detection and analysisof drive-by-download attacks and malicious JavaScript code In Proceedings ofthe International World Wide Web Conference (WWW) 2010

[14] Charlie Curtsinger Benjamin Livshits Benjamin G Zorn and Christian SeifertZozzle Fast and precise in-browser javascript malware detection In Proceedingsof the USENIX Security Symposium 2011

[15] DaftLogic daftlogic httpswwwdaftlogiccomprojects-online-javascript-obfuscatorhtm Accessed 05-30-2020

[16] Aurore Fass Robert P Krawczyk Michael Backes and Ben Stock Jast Fully syn-tactic detection of malicious (obfuscated) javascript In International Conferenceon Detection of Intrusions and Malware and Vulnerability Assessment 2018

[17] Ben Feinstein and Daniel Peck Caffeine monkey Automated collection detectionand analysis of malicious javascript In Black Hat USA 2007

[18] Github Catapult Project httpsgithubcomcatapult-project Accessed 11-12-2019

[19] Github CDNJS - the best front-end resource CDN for free httpscdnjscomAccessed 11-12-2019

[20] Github cdnjs September 2019 Usage Stats httpsgithubcomcdnjscf-statsblobmaster2019cdnjs_September_2019md Accessed 11-12-2019

[21] Github EScope httpsgithubcomestoolsescope Accessed 11-12-2019[22] Github JavaScript Obfuscator httpsgithubcomjavascript-obfuscator

javascript-obfuscator Accessed 11-12-2019[23] Github Puppeteer httpsgithubcomGoogleChromepuppeteer Accessed

11-12-2019[24] Github UglifyJS acircĂŞ a JavaScript parsercompressorbeautifier httpsgithub

commishooUglifyJS Accessed 11-12-2019[25] Github Web Page Replay httpsgithubcomcatapult-projectcatapulttree

masterweb_page_replay_go Accessed 11-12-2019[26] githubio Chrome DevTools Protocol Viewer httpschromedevtoolsgithubio

devtools-protocol Accessed 11-12-2019[27] F Howard Malware with your Mocha Obfuscation and antiemulation tricks in

malicious JavaScript In Sophos Technical Papers (2010) 2010[28] Umar Iqbal Peter Snyder Shitong Zhu Benjamin Livshits Zhiyun Qian and

Zubair Shafiq AdGraph A Graph-Based Approach to Ad and Tracker BlockingIn Proceedings of the IEEE Symposium on Security and Privacy 2020

[29] Javascript Obfuscator Javascript Obfuscator httpsjavascriptobfuscatorcomdefaultaspx Accessed 05-30-2020

[30] W Jiang H Wang and K Wu Method for Detecting Javascript Code Obfuscationbased onConvolutional Neural Network In International Journal of PerformabilityEngineering 2018

[31] Mehran Jodavi Mahdi Abadi and Elham Parhizkar Jsobfusdetector A binarypso-based one-class classifier ensemble to detect obfuscated javascript code InInternational Symposium on Artificial Intelligence and Signal Processing 2015

[32] Jordan Jueckstock and Alexandros Kapravelos VisibleV8 In-browser Monitor-ing of JavaScript in the Wild In Proceedings of the ACM SIGCOMM InternetMeasurement Conference (IMC) 2019

[33] Scott Kaplan Benjamin Livshits Benjamin Zorn Christian Siefert and CharlieCurtsinger NOFUS Automatically Detecting+ String fromCharCode (32)+ObFuSCateD toLowerCase ()+ JavaScript Code In Technical report TechnicalReport MSR-TR 2011ndash57 Microsoft Research 2011

[34] Alexandros Kapravelos Yan Shoshitaishvili Marco Cova Christopher Kruegeland Giovanni Vigna Revolver An automated approach to the detection ofevasive web-based malware In Proceedings of the USENIX Security Symposium2013

[35] Kaspersky Chrome 0-day exploit cve-2019-13720used in operation wizardopium httpssecurelistcomchrome-0-day-exploit-cve-2019-13720-used-in-operation-wizardopium94866 Accessed 11-11-2019

[36] Byung-Ik Kim Chae-Tae Im and Hyun-Chul Jung Suspicious malicious website detection with strength analysis of a javascript obfuscation In InternationalJournal of Advanced Science and Technology 2011

[37] Kyungtae Kim I Luk Kim Chung Hwan Kim Yonghwi Kwon Yunhui ZhengXiangyu Zhang and Dongyan Xu J-force Forced execution on javascript InProceedings of the International World Wide Web Conference (WWW) 2017

[38] Clemens Kolbitsch Benjamin Livshits Benjamin Zorn and Christian SeifertRozzle De-cloaking internet malware In Proceedings of the IEEE Symposium onSecurity and Privacy 2012

[39] Min Li Ying Zhou Min Yu and Chao Liu Combining static and dynamic analysisfor the detection of malicious JavaScript-bearing PDF documents In ComputerScience Technology and Application 2016


[40] Peter Likarish Eunjin Jung and Insoon Jo Obfuscated malicious javascriptdetection using classification techniques In 2009 4th International Conference onMalicious and Unwanted Software (MALWARE) 2009

[41] Gen Lu and Saumya Debray Automatic Simplification of Obfuscated JavaScriptCode A Semantics-Based Approach In 2012 IEEE Sixth International Conferenceon Software Security and Reliability 2012

[42] Mozilla Battery Status API httpsdevelopermozillaorgen-USdocsWebAPIBattery_Status_API Accessed 11-14-2019

[43] Mozilla Web Docs Source code submission httpsdevelopermozillaorgen-USdocsMozillaAdd-onsSource_Code_Submission Accessed 11-09-2019

[44] Mozilla Web Docs Standard Built-in Objects httpsdevelopermozillaorgen-USdocsWebJavaScriptReferenceGlobal_Objects Accessed 11-09-2019

[45] NPM JavaScript Obfuscator httpswwwnpmjscompackagejavascript-obfuscator Accessed 11-12-2019

[46] Obfuscatorio Obfuscatorio httpsobfuscatorio Accessed 05-30-2020[47] Łukasz Olejnik Gunes Acar Claude Castelluccia and Claudia Diaz The leaking

battery - A privacy analysis of the HTML5 Battery Status API Lecture Notes inComputer Science 2015

[48] Online Version JavaScript Obfuscator httpsobfuscatorio Accessed 11-12-2019

[49] Brian Pfretzschner and Lotfi ben Othmane Identification of dependency-based at-tacks on node js In Proceedings of the 12th International Conference on AvailabilityReliability and Security 2017

[50] Niels Provos Panayiotis Mavrommatis Moheeb Rajab and Fabian Monrose Allyour iframes point to us In Proceedings of the USENIX Security Symposium 2008

[51] Paruj Ratanaworabhan V Benjamin Livshits and Benjamin G Zorn NozzleA defense against heap-spraying code injection attacks In Proceedings of theUSENIX Security Symposium 2009

[52] Scitki Learn Documentation Scikit Learn DBSCAN httpsscikit-learnorgstablemodulesgeneratedsklearnclusterDBSCANhtmlsklearnclusterDBSCAN Accessed 11-12-2019

[53] Philippe Skolka Cristian-Alexandru Staicu and Michael Pradel Anything tohide studying minified and obfuscated code in the web In Proceedings of theInternational World Wide Web Conference (WWW) 2019

[54] Technology Lookup CDN Usage Distribution in the Top 1 Million Sites httpstrendsbuiltwithcomcdn Accessed 11-12-2019

[55] VirusTotal Analyze suspicious files and URLs to detect types of malware auto-matically share them with the security community httpmathworldwolframcomHarmonicMeanhtml Accessed 11-13-2019

[56] Wolfram MathWorld Harmonic Mean httpmathworldwolframcomHarmonicMeanhtml Accessed 11-12-2019

[57] Wei Xu Fangfang Zhang and Sencun Zhu The power of obfuscation techniquesin malicious JavaScript code A measurement study In International Conferenceon Malicious and Unwanted Software (MALWARE) 2012

[58] Wei Xu Fangfang Zhang and Sencun Zhu Jstill Mostly static detection ofobfuscated malicious javascript code In Proceedings of the third ACM conferenceon Data and application security and privacy - CODASPY 2013

[59] ZS Wang jfogs httpsgithubcomzswangjfogs Accessed 05-30-2020

APPENDIXA cdnjs Libraries after Filtering

Library Name SemanticVersion Specific File Downloads

jQuery 331 jqueryminjs 43749305jQuery Mousewheel 3113 jqueymousewheelminjs 36966724lodashjs 41711 lodashcoreminjs 28930715jQuery Cookie 141 jquerycookieminjs 13208301json3 332 json3minjs 8570063modernizr 283 modernizrminjs 8404457popperjs 1129 popperminjs 6781952underscorejs 183 underscore-minjs 6714896twitter-bootstrap 337 bootstrapminjs 4960813mobile-detect 143 mobile-detectminjs 4638880jquery-ui 311 jquery-uiminjs 4321998postscribe 208 postscribeminjs 4240441swiper 450 swiperminjs 4202031jquerylazyload 191 jquerylazyloadminjs 4190760clipboardjs 200 clipboardminjs 4131558Table 7 Top 15 cdnjs libraries by download after filtering

B Library Hash Matches in Alexa Top 100K Data

Library Matching Domainsjquery 27366twitter-bootstrap 8077jqueryui 1253Swiper 1094jquery-mousewheel 599modernizr 581clipboardjs 513jquery-cookie 477underscorejs 452mobile-detect 338postscribe 148jquerylazyload 112json3 36lodashjs 8popperjs 1Total 41055

Table 8 Libraries with their corresponding SHA-256 hashmatches for all semantic versions in our Alexa top 100Kdataset


C Code Listing for Obfuscation Technique 11 var _0x3866 = [object date forEach]2 The rotaion routine3 (function(_0x1d538b _0x59d6af) 4 var _0xf0ddbf = function(_0x6dddcd) 5 while (--_0x6dddcd) 6 _0x1d538b[push](_0x1d538b[shift]())7 8 9 _0xf0ddbf(++_0x59d6af)10 (_0x3866 0xf4))1112 The accessor function13 var _0x5a0e = function(_0x31af49 _0x3a42ac) 14 _0x31af49 = _0x31af49 - 0x015 var _0x526b8b = _0x3866[_0x31af49]16 return _0x526b8b17

Listing 2 The functionality map and following rotationroutine and the accessor function for technique 1

D Code Listing for Obfuscation Technique 21 var Ha = [a-z]2 Ia = [A-Z]3 b = function(a b) 4 if (null == b ampamp (b = 13) b = Number(b) a = String(a)

rarr 0 == b) return a5 0 gt b ampamp (b += 26)6 for (var c g f h = alength e = -1 d = ++e lt h)

rarr c = acharAt(e) Hatest(c) 7 (g = abcdefghijklmnopqrstuvwxyzindexOf(c)8 f = (g + b) 26 d += abcdefghijklmnopqrstuvwxyz

rarr charAt(f)) Iatest(c) 9 (g = ABCDEFGHIJKLMNOPQRSTUVWXYZindexOf(c)10 f = (g + b) 26 d +=

rarr ABCDEFGHIJKLMNOPQRSTUVWXYZcharAt(f)) d += c11 return d12

Listing 3 The decoder function for technique 2

E Code Listing for Obfuscation Technique 31 var a = [17 95 94 86 24 63 2 0 1423857449 ]2 function() 3 function N() 4 var c = WI1R2D5dv3Xzw8Cssplit()5 thisd = function(d) 6 if (null == d || void 0 == d) return d7 if (dlength a[6] = a[7]) throw Error(1100)8 for (var e = [] f = a[7] f lt dlength f++) 9 f a[6] == a[7] ampamp epush()10 for (var g = c K = a[7] K lt glength K++)11 if (dcharAt(f) == g[K]) 12 epush(KtoString(a[50]))13 break14 15 16 return decodeURIComponent(ejoin())17 18


F Code Listing for Obfuscation Technique 41 Z4EEm7K = function() 2 var u7K = 23 for ( u7K == 1) 4 switch (u7K) 5 case 26 return 7 B6Q function(h6Q) 8 var C7K = 29 for ( C7K == 10) 10 switch (C7K) 11 case 512 var t6Q = 013 c6Q = 014 C7K = 415 break16 17 ()

Listing 5 The switch-blade function for technique 4

1 Z4EEO7K = function() 2 return typeof Z4EEm7KB6Q === function Z4EEm7KB6Q

rarr apply(Z4EEm7K arguments) Z4EEm7KB6Q3 4 5 Z4EEx7K = function() 6 return typeof Z4EEm7KB6Q === function Z4EEm7KB6Q

rarr apply(Z4EEm7K arguments) Z4EEm7KB6Q7

Listing 6 The executor functions

F Code Listing for Obfuscation Technique 51 function Z(I) 2 var l = argumentslength3 O = []4 S = 15 while (S lt l) O[S - 1] = arguments[S++] - I6 return StringfromCharCodeapply(String O)7 8 function z(I) 9 var l = argumentslength10 O = []11 for (var S = 1 S lt l ++S) Opush(arguments[S] - I)12 return StringfromCharCodeapply(String O)13

Listing 7 The string decoder function variations

Abstract

1 Introduction

2 Defining Obfuscation

3 Data Collection Architecture

31 Web Crawler

32 Instrumented Browser

33 Log Consumer

4 Detecting Obfuscation

41 Filtering Pass

42 Abstract Syntax Tree Analysis

5 Validating the Hypothesis

51 Candidate Selection

52 Record amp Replay Webpages

53 Validation Results

6 Alexa Top 100K Data Collection

7 Results

71 Obfuscation Prevalence

72 Context and Origin of Scripts

73 Feature Site Obfuscation and eval

74 Frequently Obfuscated APIs

8 Obfuscation Techniques in the Wild

81 Feature Site Clustering

82 Observed Obfuscation Techniques

9 Limitations amp Discussion

10 Related Work

11 Conclusion

Acknowledgments

References


apply our system on pages collected over the Alexa top 100k do-mains to conduct a large-scale measurement study of JS obfuscationon the web ecosystem - including measuring traces of JS featureconcealing obfuscation on the web the origin of such obfuscatedscripts throughout the web most frequently obfuscated JS featuresthrough concealing and state of the art obfuscation techniquesused by real-world JS scripts that do not rely on eval In summaryour main contributions include

bull We propose a novel approach of identifying JS obfuscationthrough feature concealing based on the intuition that theruntime behavior of code should be evident by static analysison source code

bull We present results from a large-scale analysis of the JS fea-ture obfuscation effect in the wild We crawled the Alexa top100k domains and quantified this obfuscation pinpointedthe sources of obfuscation and measured the hidden func-tionality that lies behind such obfuscated API function callsand JS property accesses

bull We identify prominent new obfuscation techniques througha data-driven approach and present case studies that nolonger rely on the notorious eval function

2 DEFINING OBFUSCATIONBarak et al defined program obfuscation as The goal of programobfuscation is to make a program unintelligible while preserving itsfunctionality [5] Extrapolating from this definition JS obfuscationin its simplest form can be defined as when the intentional behaviorof a script cannot be fully realized until execution

Our hypothesis is that if the interactions of code with its underlyingsystem cannot be deduced from static analysis of its source code thenthe code is concealing these interactions and thus can be consideredobfuscated

Our goal with this new definition of obfuscation is to expressthe level of difficulty that static analysis tools and human analystswill have when analyzing code Notice that our definition of obfus-cation is more relaxed and can include unintended concealing offunctionality for example through heavy modification of sourcecode via minification (ie rewriting the source code to be morecompact without changing its functionality) However dynamicanalysis is dependent on the execution flow and thus is not exhaus-tive when it comes to reveal all the capabilities of the code Thiscan be alleviated through techniques such as force execution [37]which is out of scope for this paper

In this work we focus on obfuscation on the web where the codeis JavaScript and the system is the browser Interactions betweenthe code (JS) and the system (browser) are defined as invocationsof browser API features (property accesses or function calls) Forthe rest of the paper when we refer to JS obfuscation it is in termsof this hypothesis and this behavior of concealing browser APIfeatures

We describe our data collection system architecture and the sub-sequent analysis to detect this obfuscation in sect3 and sect4 respectivelyfollowed by a validation of our detection system in sect5 We then ap-ply our detection system to collect and identify obfuscation tracesfrom the Alexa top 100K in sect6 and present our findings in sect7 andsect8

3 DATA COLLECTION ARCHITECTURETo detect JS obfuscation traces in accordance with our definitionwe implemented an automated web crawling system capable ofcollecting JS script execution traces as depicted in Figure 1 and asubsequent post-processing system for the collected traces

31 Web CrawlerOur data collection worker from Figure 1 is a Dockerized containerconsisting of a web crawler and a log consumer The web crawlerdescribed in this section uses a custom build of Chromium pro-ducing JS execution traces in the form of log files (sect32) The logconsumer is a Go-based tool to compress the trace logs and archivethem after a page visit is completed (sect33)

The web crawler is based on the NodeJS library Puppeteer [23]which automates the Chromium browser and communicates withthe browser over the Chrome DevTools protocol [26] The datacollection worker pulls a job with a domain from a Redis queue andproceeds to visit the domainrsquos webpage For each domain usingPuppeteer our crawler launches an instance of our custom buildof Chromium in headless mode opens a tab and navigates to theURL concocted by prepending the domain with http whilemonitoring the progress of the visit We set a fixed 15 second timelimit for navigating to the page and upon finishing navigation weremain on the page for an additional amount of time to let the pageload any resources as needed In our system both the navigationand the subsequent loitering on the page are also limited to a total of30 seconds beginning from the start of the navigation after whichwe tear down the page and the tab followed by the closing of thebrowser instance

During the visit we collect a number of auxiliary information foreach visit - including all network requests made the headers andbodies of all HTTP resources downloaded and all script sourcesparsed through the debugger using standardDevTools protocol com-mands and event handlers over Puppeteer All of this informationare stored into a MongoDB document storage in an asynchronousmanner (as encountered) during the visit of the page

32 Instrumented BrowserBased on our hypothesis to identify the presence of obfuscationconcealing browser API features in JS source code we need toperform dynamic analysis to trace the runtime behavior of a JSscript and compare that behavior to a static analysis of the sourcetext We use VisibleV8 [32] a custom open-source variant of theV8 JS engine powering the Chromium browser to capture browserAPI accesses Much like the Linux strace tool logs Linux systemcalls made by native applications VisibleV8 (VV8) logs browserAPI function calls and property accesses made by JS applicationsVV8rsquos in-browser architecture provides significant coverage androbustness advantages over alternative approaches to JS instru-mentation that rely on JS script injection prototype patching orbrowser extensions Furthermore the fact that VV8rsquos instrumenta-tion is embedded invisibly inside the JS engine itself provides moreintrinsic stealth than generic prototype patching JS performanceunder VV8 can be significantly lower than that of stock Chromium


Instrumented V8


Session



Files

Crawler Worker

Log Consumer



MongoDBArchives

CompressedLog Data

CandidateDomains

MetaInformation

Web













FilteringPass

ASTAnalysis


Indirect Sites

Direct Sites


Unobfuscated Sites

Obfuscated Sites












































































DirectPerc Rank







DirectPerc Rank








055

060

065

070

075

080

085

090

Mea

n Si

lhou

ette

Sco

re

50

75

100

125

150

175

200

225

Noise
















































































































Abstract

1 Introduction



31 Web Crawler


33 Log Consumer


41 Filtering Pass







7 Results









10 Related Work

11 Conclusion

Acknowledgments

References


Instrumented V8


Session



Files

Crawler Worker

Log Consumer



MongoDBArchives

CompressedLog Data

CandidateDomains

MetaInformation

Web













FilteringPass

ASTAnalysis


Indirect Sites

Direct Sites


Unobfuscated Sites

Obfuscated Sites












































































DirectPerc Rank







DirectPerc Rank








055

060

065

070

075

080

085

090

Mea

n Si

lhou

ette

Sco

re

50

75

100

125

150

175

200

225

Noise
















































































































Abstract

1 Introduction



31 Web Crawler


33 Log Consumer


41 Filtering Pass







7 Results









10 Related Work

11 Conclusion

Acknowledgments

References


FilteringPass

ASTAnalysis


Indirect Sites

Direct Sites


Unobfuscated Sites

Obfuscated Sites












































































DirectPerc Rank







DirectPerc Rank








055

060

065

070

075

080

085

090

Mea

n Si

lhou

ette

Sco

re

50

75

100

125

150

175

200

225

Noise
















































































































Abstract

1 Introduction



31 Web Crawler


33 Log Consumer


41 Filtering Pass







7 Results









10 Related Work

11 Conclusion

Acknowledgments

References































































DirectPerc Rank







DirectPerc Rank








055

060

065

070

075

080

085

090

Mea

n Si

lhou

ette

Sco

re

50

75

100

125

150

175

200

225

Noise
















































































































Abstract

1 Introduction



31 Web Crawler


33 Log Consumer


41 Filtering Pass







7 Results









10 Related Work

11 Conclusion

Acknowledgments

References



055

060

065

070

075

080

085

090

Mea

n Si

lhou

ette

Sco

re

50

75

100

125

150

175

200

225

Noise
















































































































Abstract

1 Introduction



31 Web Crawler


33 Log Consumer


41 Filtering Pass







7 Results









10 Related Work

11 Conclusion

Acknowledgments

References








































































































Abstract

1 Introduction



31 Web Crawler


33 Log Consumer


41 Filtering Pass







7 Results









10 Related Work

11 Conclusion

Acknowledgments

References

hiding in plain site: detecting javascript obfuscation ...of javascript (js), source code is...

Documents