locality-sensitive hashing for efﬁcient web application ...ilanben/papers/lsh-bb.pdf ·...

Locality-Sensitive Hashing for Efficient Web Application Security Testing

Ilan Ben-Bassat1 and Erez Rokah2

1School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel2IBM, Herzliya, Israel

[email protected], [email protected]

Keywords: Security Testing, Automated Crawling, Rich Internet Applications, DOM Similarity, Locality-SensitiveHashing, MinHash.

Abstract: Web application security has become a major concern in recent years, as more and more content and servicesare available online. A useful method for identifying security vulnerabilities is black-box testing, which relieson an automated crawling of web applications. However, crawling Rich Internet Applications (RIAs) is a verychallenging task. One of the key obstacles crawlers face is the state similarity problem: how to determine iftwo client-side states are equivalent. As current methods do not completely solve this problem, a successfulscan of many real-world RIAs is still not possible. We present a novel approach to detect redundant content forsecurity testing purposes. The algorithm applies locality-sensitive hashing using MinHash sketches in orderto analyze the Document Object Model (DOM) structure of web pages, and to efficiently estimate similaritybetween them. Our experimental results show that this approach allows a successful scan of RIAs that cannotbe crawled otherwise.

1 INTRODUCTION

The information era has turned the Internet into a cen-tral part of modern life. Growing amounts of data andservices are available today, making web applicationdevelopment an important skill for both enterprisesand individuals. The heavier the reliance on web ap-plications, the higher the motivation of attackers toexploit security vulnerabilities. Unfortunately, suchvulnerabilities are very common (Gordeychik et al.,2010), leading to an increasing need to detect and re-mediate them.

Black-box security scanners address the problemof ensuring secured web applications. They simulatethe behavior of a user crawling an application with-out access to the source code. When discovering newpages and new content, the scanner performs a se-ries of automated security tests, based on a databaseof known vulnerabilities. This way, the applicationcan be easily and thoroughly analyzed from a secu-rity point of view. See (Bau et al., 2010; Doupe et al.,2010) for surveys on available scanners.

It is clear that a comprehensive security scan re-quires, among other factors, both high coverage andefficiency. If the crawler cannot reach significant partsof the application in a reasonable time, then the secu-rity assessment of the application will be incomplete.

In particular, a web crawler should refrain from wast-ing time and memory resources on scanning pagesthat are similar to previously visited ones. The defini-tion of redundant pages depends on the crawling pur-pose. For traditional web indexing purposes, differentcontents imply different pages. However, for securitytesting purposes, different sets of vulnerabilities im-ply different pages, regardless of the exact content ofthe pages.

While the problem of page similarity (or statesimilarity) is a fundamental challenge for all webcrawlers (Pomikalek, 2011), it has become even moresignificant since the emergence of a new generationof web applications, often called Rich Internet Appli-cations (RIAs).

RIAs make extensive usage in technologies suchas AJAX (Asynchronous JavaScript and XML) (Gar-rett et al., 2005), which enable client-side processingof data that is asynchronously sent to and from theserver. Thus, the content of a web page changes dy-namically without even reloading the page. New datafrom the server can be reflected to the user by modify-ing the DOM (Nicol et al., 2001) of the page throughUI events, albeit no change was done to the URL it-self. Although such technologies increase the respon-siveness of web applications and make them moreuser-friendly, they also pose a great difficulty on mod-

ern web crawlers. While traditional crawlers focusedon visiting all possible URLs, current-day crawlersshould examine every client state that can be gener-ated via an execution of a series of HTML events.

The exponentially large number of possible statesemphasizes the need to efficiently detect similaritybetween different application states. Failing to solvethe state similarity problem results in a poor scanquality. If the algorithm is too strict in decidingwhether two states are similar, then too many redun-dant states would be crawled, yielding long or evenendless scans. On the other hand, if the resemblancerelation is too lax, the crawler might mistakenlydeem new states as near-duplicates of previously-seenstates, resulting in an incomplete coverage.

As stated before, every URL of a RIA is actuallya dynamic application, in which states are accessedthrough UI events. Consequently, eliminating dupli-cate content cannot rely solely on URL analysis, suchas in (Bar-Yossef et al., 2009). Instead, a commonmethod that black-box scanners use in order to de-cide whether two pages are similar, is to analyze theirDOM structure and properties, without taking the textcontent into account. This is a useful approach for se-curity oriented crawling, since often the text contentof a page is irrelevant to the security assessment. Asopposed to the content, the elements and logic that al-low user interaction are more relevant for a securityanalysis. For example, consider a commercial webapplication with a catalog section, containing thou-sands of pages. The catalog pages share the sametemplate, yet every one of them is dedicated to a dif-ferent product. While the variety of products may beof interest for search engines like Google or Yahoo, itis of no great importance to black-box security scan-ners, as all pages probably share the same set of se-curity vulnerabilities. Therefore, an analysis of theDOM structures of the pages might suggest that theyare near-duplicates of each other.

In the past decade, several methods have been pro-posed to determine similarity between DOMs (seesection 2). Most of the methods include DOM nor-malization techniques, possibly followed by applyingsimple hash functions. While these methods wereproven to be successful in scanning several RIAs,they are often too strict, leading to a state explo-sion problem and very long scans. Moreover, sincehash functions used in this context so far have, tothe best of our knowledge, no special properties, mi-nor differences between two canonical DOM formsmight result in two completely different hash values.Therefore, many complex applications still cannot becrawled in a reasonable time. Approaches involvingdistance measures, such as (Mesbah and Van Deursen,

2008), are not scalable and require computing dis-tances between all-pairs of DOMs.

In this paper we present a different approachfor the state similarity problem for security orientedcrawling. Our approach is based on locality-sensitivehashing (LSH) (Indyk and Motwani, 1998), and onMinHash sketches in particular (Broder, 1997; Broderet al., 2000). A locality-sensitive hash function satis-fies the property that the probability of collision, or atleast close numerical hash values, is much higher forsimilar objects than for other objects. LSH schemeshave already been used in the context of detectingduplicate textual content (see (Pomikalek, 2011) fora survey). Recently, MinHash sketches have alsobecome an efficient solution for bioinformatic chal-lenges, where large amounts of biological sequencedata require efficient computational methods. Assuch, they can be used to detect inexact matches be-tween genomic sequences (Berlin et al., 2015; Popicand Batzoglou, 2016). However, LSH techniqueshave not yet been used by black-box security scan-ners. The flexibility of MinHash sketches enables de-tecting duplicate states if two DOMs differ in a smallnumber of DOM elements, regardless of their type.The LSH technique that we use makes the algorithmscalable for large RIAs as well. Combined together,our method makes exploring industrial RIAs feasible.This paper is an extended version of the one appearedin the proceedings of ICISSP 2019 (Ben-Bassat andRokah, 2019).

2 RELATED WORK

A common approach for an efficient detection of du-plicate states in AJAX applications is to apply sim-ple hash functions on the DOM string representation.The authors of (Duda et al., 2009) and (Frey, 2007)compute hash values according to the structure andcontent of the state. However, although these meth-ods can remove redundant copies of the same state,they are too strict for the purpose of detecting near-duplicate pages.

CRAWLJAX (Mesbah and Van Deursen, 2008) isa crawler for AJAX applications that decides whethertwo states are similar according to the Levenshteindistance (Levenshtein, 1966) between the DOM trees.Using this method, however, for computing distancesbetween all-pairs of possible states is infeasible inlarge RIAs. In a later work (Roest et al., 2010),the state equivalence mechanism of the algorithmis improved by first applying an Oracle Compara-tor Pipelining (OCP) before hash values are com-puted. Each comparator is targeted to strip the DOM

string from an irrelevant substring, which might causemeaningless differences between two states, e.g., timestamps, advertisement banners. CRAWLJAX is there-fore less strict in comparing application states, butthe comparator pipeline requires manual configura-tion and adjustment for every scan. FEEDEX (Fardand Mesbah, 2013), which is built on top of CRAWL-JAX, uses tree edit distance (Tai, 1979) to comparetwo DOM trees.

The notion of Jaccard similarity is used injAk (Pellegrino et al., 2015). In this paper, the authorsconsider two pages as similar if they share the samenormalized URL and their Jaccard index is above acertain threshold. A normalized URL is obtained af-ter stripping the query values and sorting the queryparameters lexicographically. The Jaccard similarityis computed between sets of JavaScript event, linksand HTML forms that appear in the web pages.

Two techniques to improve any DOM based stateequivalence mechanism are presented in (Choudharyet al., 2012). The first aims to discover unnecessarydynamic content by loading and reloading a page.The second identifies session parameters and ignoresdifferences in requests and responses due to them.

The DOM uniqueness tool described in (Ayoubet al., 2013) identifies pages with similar DOM struc-ture by detecting repeating patterns and reducingthem to a canonical DOM representation, which isthen hashed into a single numerical value. The usercan configure the algorithm and determine whichHTML tags are included in the canonical represen-tation, and whether to include their text content. Thismethod captures structural changes, such as additionsor deletions of rows in a table. It is also not affectedby elements shuffling, since it involves sorting the ele-ments in every DOM subtree. However, modificationsthat are not recognized as part of a structural patternlead to a false separation between two near-duplicatestates. The method in (Moosavi et al., 2014) furtherextends this algorithm by splitting a DOM tree intomultiple subtrees, each corresponding to an indepen-dent application component, e.g., widgets. The DOMuniqueness algorithm is applied on every componentindependently, thus avoiding explosion in the numberof possible states when the same data is being dis-played in different combinations.

The structure of a page is also the key for cluster-ing similar pages in (Doupe et al., 2012), in which amodel of the web application’s state machine is built.The authors model a page using its links (anchorsand forms), and store this information in a prefix tree.These trees are vectorized and then stored in anotherprefix tree, called the Abstract Page Tree (APT). Sim-ilar pages are found by analyzing subtrees of the APT.

A different approach for clustering applicationstates into different equivalence clusters appears insoftware tools that model and test RIAs using execu-tion traces, such as RE-RIA (Amalfitano et al., 2008),CrawlRIA (Amalfitano et al., 2010b), and CreRIA(Amalfitano et al., 2010a). The clustering is done byevaluating several equivalence criteria, which dependon the DOM set of elements, event listeners and eventhandlers. Two DOMs are considered equivalent if oneset contains the other as a subset. This method has ahigh memory consumption and computation time.

Research efforts have been made during the yearsin detecting near-duplicate text, especially in the con-text of the Web. As the general problem of duplicatecontent detection was not the focus of this paper, werefer the readers to a detailed survey for more infor-mation on the subject (Pomikalek, 2011).

3 MINHASH SKETCHES

MinHash is an efficient technique to estimate the Jac-card similarity between two sets. Given two sets, Aand B, the Jaccard similarity (Jaccard, 1901) of thesets, J(A,B), is a measure of how similar the sets are:

J(A,B) =|A∩B||A∪B|

(1)

The Jaccard similarity value ranges between 0(disjoint sets) and 1 (equal sets). However, directcomputation of this ratio requires iterating over allthe elements of A and B. MinHash (Broder, 1997)is an efficient method to estimate the Jaccard similar-ity of two sets, without explicitly constructing theirintersection and union. The efficiency of the methodcomes from reducing every set to a small number offingerprints.

Let S be a set, and let h be a hash function whosedomain includes the elements of S. We define hmin(S)as the element a of S, which is mapped to the mini-mum value, among all the elements of S:

hmin(S) = argmina∈Sh(a) (2)

We consider again two sets of elements, A and B.One can easily verify that:

Pr[hmin(A) = hmin(B)] = J(A,B) (3)

Eq. 3 implies that if we define a random variableRA,B as follows:

RA,B =

{1 if hmin(A) = hmin(B)0 if otherwise (4)

then RA,B is an indicator variable which satisfiesE[RA,B] = J(A,B). However, RA,B is either 0 or 1, so itis not useful as an estimator of the Jaccard similarityof A and B. As it is done in many estimation tech-niques, it is possible to reduce the variance of RA,Bby using multiple hash functions and computing theratio of the hash functions, for which the minimumelement is the same. In other words, assume now aset of ` hash functions, h1, . . . ,h`. For each 1 ≤ i ≤ `

we define R(i)A,B as in Eq. 4, replacing h with hi. The

improved estimator for the Jaccard similarity of A andB, T (A,B), is now given by:

T (A,B) =∑`i=1 R(i)

A,B

`(5)

By Chernoff bound, it can be shown that the ex-pected error is O( 1√

`).

The compressed representation〈h1

min(S), . . . ,h`min(S)〉 of a set S is defined as its

MinHash sketch. Since ` is significantly smaller thanthe size of S, the estimation of the Jaccard similarityis efficient in both time and memory.

4 METHOD

In this section we present the complete algorithm fordetecting duplicate content during a black-box secu-rity scan. The first step is transforming a web page,given as an HTTP response string, into a set of shin-gles, or k-mers. In the second step, the algorithmuses MinHash sketches in order to efficiently com-pute similarity between pages. However, the methoddescribed in section 3 still requires that we computethe similarity between all possible pairs of states. In-stead, we use an efficient LSH approach that focusesonly on pairs that are potentially similar. We outlinethe complete algorithm at the end of this section.

4.1 Set Representation of Web Pages

As stated in section 1, the text content of a web pageis usually irrelevant for the state similarity problem inthe context of security scans. Therefore, we rely onthe DOM representation of the page. The algorithmextracts the DOM tree from the HTTP response, andserializes it to a string. The relevant information forthe state similarity task includes the DOM elementsand their structure, the events and the event handlersthat are associated with the elements, as well as someof their attributes. Yet, for simplicity reasons, in thispaper we only consider the names of the elements andthe structure of the DOM.

Since we are interested in using MinHash, there isa need to transform the string representation of theDOM into a set of elements. For the sake of thispurpose, we use the notion of shingles, or k-mers,which are sequences of any k consecutive words. Inthis case, since the DOM elements are the buildingblocks of the DOM tree, we consider every element asa word. The algorithm can filter out part of the DOMelements, if they are marked as irrelevant. Figure 1illustrates the process of constructing the set repre-sentation of an HTTP response.

A key factor in the performance of the algorithmis the choice of the value of k. If k is too small, thenmany k-mers will appear in all web pages, and most ofthe pages will be similar to each other. Alternatively,as the value of k becomes too high, the comparison ofstates is too strict. So, k should have a large enoughvalue, such that the probability of different states shar-ing the same k-mer, is low.

4.2 Efficient LSH for MinHash Sketches

To accelerate the process of detecting duplicate con-tent, we use a hash table indexing approach, which re-arranges the sketches in a way that similar states aremore likely to be stored closely in our data structure.Dissimilar states, on the other hand, are not stored to-gether, or are rarely stored in proximity. Such an ap-proach is also used in other domains, e.g., genomeassembly (Berlin et al., 2015).

The algorithm constructs a data structure of ` hashtables, each corresponding to a different hash func-tion. In the i-th hash table, we map every hash valuev, which was computed by the i-th hash function, to aset of DOM IDs, which correspond to DOMs that arehashed to this value. So, if we denote the set of allDOMs by P , and for every set-representation P ∈ Pwe denote its ID by ID(P), then the i-th table is amapping of the following form:

v 7→{

ID(P) | P ∈ P ∧himin(P) = v

}(6)

Using Eq. 3, we can see that the higher the Jac-card similarity of two DOMs is, the higher the prob-ability of their IDs being in the same set (or bucket).Given two DOM representations, P and P′, we getby Eq. 5 that an estimator for the Jaccard similarityJ(P,P′) can be computed according to the number ofhash tables, for which ID(P) and ID(P′) are in thesame bucket. This follows directly from the fact thatfor every 1≤ i≤ `, if the i-th hash table contains bothID(P) and ID(P′) in the same bucket, then it holdsthat hi

min(P) = himin(P

′). So, we can derive an esti-mator, T ∗, for this particular LSH scheme, which isbased on Eq. 5. Let us denote the i-th hash table by gi,

Figure 1: Constructing a set representation of an HTTP response. (a) HTTP response snippet. The elements of the DOM treeare marked in red. (b) Serialization of the corresponding DOM tree to a string. (c) The set of all 5-mers of consecutive DOMelements. This is the set representation of the response for k = 5.

1≤ i≤ `, and let us mark the bucket which is mappedto hash value v in the i-th table, by gi(v). We thenhave:

T ∗(P,P′) =

∣∣{i | ∃v.ID(P), ID(P′)⊆ gi(v)}∣∣

`(7)

Following Eq. 7, there is no need to compare ev-ery DOM to all other DOMs. For every pair of DOMrepresentations, which share no common bucket, theestimated value of their Jaccard similarity is 0.

4.3 The Complete Algorithm

We now combine the shingling process and the LSHtechnique, and describe the complete algorithm fordetecting duplicate content, during a black-box secu-rity scan. For simplicity, we assume a given set ofapplication states, S . With this assumption in mind,we can discard the part of parsing web pages and ex-tracting links and JavaScript actions. A pseudo-codeof the algorithm is given in Algorithm 1.

In an initialization phase (lines 1-3), the code gen-erates ` hash functions, out of a family H of hashfunctions. For each hash function, hi, a correspond-ing hash table, gi, is allocated in an array of hash ta-bles. The rest of the code (lines 4-20) describes thecrawling process and the creation of an index of non-duplicate pages.

For each application state, s, the algorithm cre-ates its DOM set representation, P, using the methodShingle(s,k). The method extracts the DOM tree of

the web page of state s, filters out all text and irrele-vant elements, and converts the list of DOM elementsinto a set of overlapping k-consecutive elements. Weomit the description of this method from the pseudo-code.

lines 6-13 analyze the DOM set representation, P.A MinHash sketch of the set of shingles is computedby evaluating all hash functions. While doing so, wemaintain a count of how many times every DOM IDshares the same bucket with the currently analyzedDOM, P. This is done using a mapping, f , which isimplemented as a hash table as well, with a constant-time insertion and lookup (on average). The highestestimated Jaccard similarity score is then found (line14). If this score is lower than the minimum threshold,τ, then application state s is considered to have newcontent for the purpose of the scan. In such a case, itis added to the index, and the data set of hash valuesis updated by adding the MinHash sketch of the newstate. Otherwise, it is discarded as duplicate content.

The state equivalence mechanism described hereis not an equivalence relation, since it is not transi-tive. This fact implies that the order in which we an-alyze the states can influence the number of uniquestates found. For example, for every τ and everyodd integer m, we can construct a series of applica-tion states s1, . . . ,sm, such that J(si,si+1)≥ τ for every1≤ i≤ m−1, but J(si,s j)< τ for every 1≤ i, j ≤ msuch that |i− j| ≥ 2. Algorithm 1 outputs either bm

2 cor dm

2 e unique states, depending on the scan order. Al-though theoretically possible, we argue that web ap-plications rarely exhibit such a problematic scenario.

Algorithm 1: Removing duplicate web application states.

Input: S : set of web application statesH : family of hash functionsk: shingle size`: MinHash sketch sizeτ: Jaccard similarity threshold

Output: set of states with no near-duplicates1: 〈h1, . . . ,h`〉 ← sample ` functions from H2: [g1, . . . ,g`]← array of ` hash tables3: index← /0

4: for s ∈ S do5: P← Shingle(s,k)6: f ← mapping of type N→ N+

7: for i in range 1, . . . , ` do8: v← hi

min(P)9: for docId in gi(v) do

10: if docId ∈ f then11: f (docId)← f (docId) + 112: else13: f (docId)← 114: score← max j∈ f f ( j)15: if score < τ` then16: for i in range 1, . . . , ` do17: v← hi

min(P)18: if v /∈ gi then gi(v)← /0

19: gi(v)← gi(v) ∪ {ID(s)}20: index← index ∪ {s}21: return index

4.4 MinHash Generalization Properties

It is clear that the MinHash algorithm generalizesnaıve methods that directly apply hash functions onthe entire string representation of the DOM, suchas (Duda et al., 2009) and (Frey, 2007). By gener-alization we mean that for any given pair of states,s1 and s2, and any given hash function h, if h(s1) =h(s2), then applying our LSH scheme also yields anequality between the states. Therefore, any pair ofpages that are considered the same state by a naıvehashing algorithm, will also be treated as such by themethod we propose.

More interesting is the fact that our algorithm isalso a generalization of a more complex method. Thealgorithm in (Ayoub et al., 2013) identifies repeatingpatterns that should be ignored when detecting dupli-cate content. Denote by d1 and d2 two DOM stringsthat differ in the number of times that a repeating pat-tern occurs in them. More precisely, let d1 = ARRBand d2 =ARRR..RRRB be two DOM strings, where A,B, and R are substrings of DOM elements. Let us as-sume that the method in (Ayoub et al., 2013) identifiesthe repeating patterns and obtains the same canonicalform for both d1 and d2, which is ARB. This way,d1 and d2 are identified as duplicates by (Ayoub et al.,

2013). It is easy to see then that if k < 2 |R|, where thelength of a DOM substring is defined as the number ofDOM elements in it, then the MinHash approach willalso mark these two DOM strings as duplicates, sincethere is no k-mer that is included in d2 and not in d1,and vice versa. This proof does not hold for the caseof d1 = ARB. However, our approach will also markd1 and d2 in this case as near-duplicate content withhigh probability if k is relatively small comparing tothe lengths of A, B, and R.

5 PERFORMANCE EVALUATION

In this section we present the evaluation process ofthe LSH based approach for detecting similar states.We report the results of this method when applied toreal-world applications, and compare it to four otherstate equivalence heuristics.

5.1 Experimental Setup

We implemented a prototype of the LSH mechanismfor MinHash sketches as part of IBM R© Security App-Scan R© tool (IBM Security AppScan, 2016). App-Scan uses a different notion for crawling than mostother scanners. It is not a request-based crawler,but rather an action-based one. The crawler utilizesa browser to perform actions, instead of manipulat-ing HTTP requests and directly executing code via aJavaScript engine. While processing a new page, allpossible actions, e.g., clicking on a link, submittinga form, inputting a text, are extracted and added to aqueue of unexplored actions. Every action is executedafter replaying the sequence of actions that resulted inthe state from which it was originated. The crawlingstrategy is a mixture of BFS and DFS. As a result, thecrawler can go deeper into the application, while stillavoid focusing on only one part of it.

The MinHash algorithm can be combined withany crawling mechanism. The algorithm marks re-dundant pages, and their content is ignored. Offlinedetection of duplicate content is not feasible for mod-ern, complex applications with an enormous amountof pages, since one cannot first construct the entire setof possible pages. In fact, this set can be infinite, ascontent might be dynamically generated.

We performed security scans on seven real-worldweb applications and compared the efficiency ofour method with other DOM state equivalence algo-rithms. The first three applications are simple RIAs,which can be manually analyzed in order to obtain atrue set of all possible application states. We weregiven access to two IBM-internal applications for

managing security scans on the cloud: Ops Lead andUser Site. As a third simple RIA we chose a pub-lic web-based file manager called elFinder (elFinder,2016). We chose applications that are used in prac-tice, without any limitations on the web technologiesthey are implemented with.

In order to assess the performance of a state equiv-alence algorithm, it must also be tested on com-plex applications with a significant number of near-duplicate pages. Otherwise, inefficient mechanismsto detect similar states might be considered success-ful in crawling web applications. Therefore, wealso conducted scans on four complex online applica-tions: Facebook, the famous social networking web-site (Facebook, 2004); Fandango, a movie ticketingapplication (Fandango, 2000); GitHub, a leading de-velopment platform for version control and collabora-tion (GitHub, 2008); and Netflix, a known service forwatching TV episodes and movies (Netflix, 1997), inwhich we crawled only the media subdomain. Theseapplications were chosen due to their high complex-ity and extensive usage of modern web technologies,which pose a great challenge to web application secu-rity scanners. In addition, they contain a considerableamount of near-duplicate content, which is not alwaysthe case when it comes to offline versions of real-world web applications. We analyzed the results ofscanning the four complex applications without com-paring them to a manually-constructed list of applica-tion states.

Since a full process of crawling a complex RIAcan take several hours or even days, a time limit wasset for every scan. Such a limit also enables to testhow fast the crawler can find new content, which isan important aspect of a black-box security scanner.

We report the number of non-redundant statesfound in the scans, along with their duration times.However, the number of discovered states does notnecessarily reflect the quality of the scan: a stateequivalence algorithm might consider two differentstates as one (false merge), or treat two views of thesame state as two different states (false split). Fur-thermore, the scan may not even reach all possibleapplication states. In order to give a measure of howmany false splits occur during a scan, we compute thescan efficiency. The efficiency of a scan is definedas the fraction of the scan results which contains newcontent. In other words, this is the ratio between thereal number of unique states found and the numberof unique states reported during the scan. A too-strictstate equivalence relation implies more duplicate con-tent and a lower scan efficiency. The coverage ofthe scan is defined as the ratio between the numberof truly unique states found and the real number of

unique states that exist in the application. If the re-lation is too lax, then too many false merges occur,leading to a lower scan coverage. Inefficient scanscan have low coverage as well, if the scan time is con-sumed almost entirely in exploring duplicate content.As the scan coverage computation requires knowingthe entire set of application states, we computed itonly for the three simple applications. The scan ef-ficiency is reported for all scanned applications.

This paper suggests a new approach for the statesimilarity problem. Hence we chose evaluation cri-teria that are directly related to how well the scan-ner explored the tested applications. We do not as-sess the quality of the explore through indirect metricssuch as the number of security issues found duringthe scans. In addition, during the test phase AppScansends thousands of custom test requests that were cre-ated during the explore stage. Such amount of re-quests could overload the application server or evenaffect its state, and we clearly could not do that forapplications like Facebook or GitHub. For these rea-sons we do not report how many security vulnerabili-ties were detected in each scan. Another metric that isnot applicable for online applications is the code cov-erage, i.e., the number of lines of code executed by thescanner. This metric cannot be computed as we onlyhave limited access to the code of these complex on-line applications. However, it is clear that scans thatare more efficient and have a higher coverage rate candetect more security vulnerabilities.

Our proposed method, MinHash, was comparedto four other approaches: two hash-based algorithms,and two additional security scanners that apply nonhash-based techniques to detect similar states. TheSimple Hash strategy hashes every page accordingto the list of DOM elements it contains (order pre-served). The second hash-based method, the DOMUniqueness technique (Ayoub et al., 2013), identi-fies similar states by reducing repeating patterns inthe DOM structure and then applying a simple hashfunction on the reduced DOMs. As a third ap-proach we used jAk (Pellegrino et al., 2015), whichsolves the state similarity algorithm by computingthe Jaccard similarity between pages with the samenormalized URL. Another non hash-based approachwas evaluated by using CRAWLJAX (Mesbah andVan Deursen, 2008). The crawler component of thelatter tool uses the Levenshtein edit distance to com-pute similarity between pages. We used the defaultconfiguration of these scanners. Section 2 providesmore details on the approaches we compared our toolwith.

We scanned the tested applications three times,each time using a different DOM state equivalence

Table 1: Results of security scans on three simple real-world web applications using five different strategies for DOM stateequivalence: jAk (JK), CRAWLJAX (CJ), Simple Hash (SH), DOM Uniqueness (DU), and MinHash (MH). The first rowshows the real number of unique states in the application, whereas the second row contains the number of non-redundantstates reported in each scan. The next two lines provide the coverage and efficiency rates of the scans. In the last row, aruntime of thirty minutes corresponds to reaching the time limit.

elFinder User Site Ops LeadJK CJ SH DU MH JK CJ SH DU MH JK CJ SH DU MH

App states 7 18 27Scan states 7 12 38 34 7 18 20 31 21 18 31 33 42 41 25

Coverage (%) 100 100 100 100 100 100 100 100 100 94 100 100 93 89 89Efficiency (%) 100 58 18 21 100 100 90 58 86 94 87 82 60 59 96

Time (m) 8 7 30 30 10 2 2 4 3 2 9 10 30 23 12

strategy. All scans were limited to a running time of30 minutes. In the MinHash implementation we setthe value of k to 12, the number of hash functions, `,to 200, and the minimum similarity threshold, τ, wasset to 0.85. The values of k and τ were determinedby performing a grid search and choosing the valueswhich gave optimal results on two of the tested appli-cations. The number of hash functions, `, was com-puted using Chernoff bound, so the expected error inestimating the Jaccard similarity of two pages is atmost 7%.

5.2 Results for Simple RIAs

Table 1 lists the results of the experimental scans weconducted on three simple RIAs. The results showthat using MinHash as the DOM state equivalencescheme yields fast and efficient security scans, witha negligible impact on the coverage, if any. Scan-ning simple applications with jAk also produces high-quality results. The rest of the approaches were lessefficient, overestimating the number of unique states.Some of the scans involving the other hash-basedtechniques were even terminated due to scan timelimit. However, since the tested applications werevery limited in their size, these approaches usuallyhad high coverage rates.

5.2.1 MinHash Results

The MinHash scans were very fast and produced con-cise application models. They quickly revealed thestructure of the applications, leading to efficient scansalso in terms of memory requirements. The pro-posed algorithm managed to overcome multiple typesof changes between pages that belong to the samestate. This way, pages with additional charts, ban-ners, or any other random DOM element, were usu-ally considered as duplicates (see Figure 2). Similarpages that differ in the order of their components werealso correctly identified as duplicate content, since the

Figure 2: Successful merge of two views with the samefunctionality using the MinHash algorithm. (a) Upper part:a list view of a directory. (b) Lower part: the same viewwith a textual popup.

LSH scheme is almost indifferent to a component re-ordering. This property holds since a component re-ordering introduces new k-mers only when new com-binations of consecutive components are created. Asthe number of reordered components is usually sig-nificantly smaller than their size, the probability ofa change in the MinHash sketch is low. In addition,multiple views of the same state that differ in a repeat-ing pattern were also detected. This is in correspon-dence with section 4.4, which shows that our methodis a generalization of the DOM Uniqueness approach.

There were, however, cases where different stateswere considered the same, and this led to some con-tent being missed. The missing states were very sim-ilar to states that had already been reported in themodel. See section 6 for a discussion on this matter.

5.2.2 Results of Other Hash-Based Approaches

The scans produced by the other hash-based methodshad high coverage rate. However, they were some-times very long, and the number of reported states wasvery high (up to a factor of five compared to the cor-

Table 2: Results of security scans on four complex real-world web applications using five different strategies for DOM stateequivalence. The first column for every application contains the number of non-redundant states reported in each scan. Thesecond column provides the efficiency rate (%). Scans that did not reach the time limit and crashed due to memory or otherproblems - are marked with asterisk.

Facebook Fandango GitHub NetflixStates Efficiency States Efficiency States Efficiency States Efficiency

jAk 1* 100 54 37 26 34 47 32CRAWLJAX 68 38 32 50 184 30 40 40Simple Hash 694* 2 493 4 306 22 206 7DOM Uniq. 692* 3 468 5 266 30 133 17MinHash 200 83 34 82 108 81 27 78

rect number of states). Some were even terminateddue to time limit. The stringency of the equivalencerelations led to inefficient results and to unclear ap-plication models. Such models will also result in alonger test phase, as the black-box scan is about tosend hundreds and thousands of redundant customtest requests.

As expected, the naıve approach of simple hashingwas very inefficient. The DOM Uniqueness methodhad better results in two out of the three tested appli-cations. However, relevant scans were still long andinefficient (as low as 21% efficiency rate). Althoughthe algorithm in (Ayoub et al., 2013) detects similarpages by reducing repeating patterns in the DOM, itcannot detect similar pages that differ in other parts ofthe DOM. Consider, for example, two similar pageswhere only one of them contains a textual popup.Such a pair of pages was mistakenly deemed as twodifferent states (see Figure 2). A similar case occurredwhen the repeating pattern was not exact, e.g., a tablein which every row had a different DOM structure).

5.2.3 Results of Non Hash-Based Tools

The scans obtained using jAk and CRAWLJAX werealso very fast and with perfect coverage. However,they were not always as efficient as the MinHash-based scans. The relatively small number of pagesand states in the tested applications enabled shortrunning times, although these non hash-based meth-ods perform a high number of comparisons betweenstates. Among the two methods, jAk was more effi-cient than CRAWLJAX.

5.3 Results for Complex RIAs

The results shown in Table 2 emphasize that an accu-rate and efficient state equivalence algorithm is cru-cial for a successful scan of complex applications.The MinHash algorithm enabled scanning these ap-plications, while the naıve approach and the DOM

uniqueness algorithm completely failed in doing soin some cases. CRAWLJAX and jAk had reasonableresults; however, their efficiency rates were not high.

5.3.1 MinHash Results

The MinHash-based scans efficiently reported dozensof application states, with approximately 80% of thestates being truly unique. In addition, the numberof false merges was very low. The algorithm suc-cessfully identified different pages that belong to thesame application state. For example, pages that de-scribe different past activities of Facebook users wereconsidered as duplicate content. This is, indeed, cor-rect, as these pages contain the same set of possibleuser interactions and the same set of possible secu-rity vulnerabilities. A typical case in which the Min-Hash algorithm outperforms other methods can alsobe found in the projects view of GitHub, as can beseen in Figure 3. A GitHub user has a list of starred(tracked) projects. Two GitHub users may have com-pletely different lists, but these two views are equiva-lent, security-wise. Figure 4 depicts another exampleof a successful merge of states.

At the same time, the algorithm detected statechanges even when most of the content had not beenchanged. For instance, a state change occurs when theFandango application is provided with information onthe user’s location. Prior to this, the user is askedabout her location via a form. When this form is filledand sent, the application state is changed, and geotar-geted information is presented instead. This may af-fect the security vulnerabilities in a page; hence theneed to consider these two states as different.

A successful split can also be found in the resultsof the Facebook scan. The privacy settings state is al-tered if an additional menu is open, and this changewas detected by our method (see Figure 5). Sucha case frequently occurs in complex online applica-tions, where there are several actions that are commonto many views, e.g., opening menus or enabling a chat

Figure 3: Successful merge of two different states using the MinHash algorithm. Both sides of the figure show lists of starred(tracked) GitHub repositories. (a) The left list contains information on the repository programming language for the first andlast items only. (b) The right list does not contain information on number of repository copies for the third item. Althoughthe items of the lists are not identical in their structure, the MinHash algorithm can detect that both lists belong to the sameapplication state. On the other hand, the DOM Uniqueness algorithm fails to merge these two states, since the lists do nothave the same constant repeating pattern. The DOM structure of every item in every list might be different than the structuresof its neighboring items, and thus the items cannot be reduced to a single repeating pattern.

Figure 4: Successful merge of two different states usingMinHash sketches. (a) Upper part: movie overview page(b) Lower part: an overview page of a different movie.

sidebar. Therefore, the number of possible states canbe high. However, almost every combination differsfrom the others, so they are usually considered differ-ent states. We discuss a method to safely reduce thenumber of reported states in section 6.

There were also cases where the MinHash algo-rithm incorrectly split or merged states. One suchcase was during the Facebook scan, when analyzingtwo states that are accessed through the security set-

tings page. One state allows choosing trusted friends,and the other enables a password change. Despitetheir similarity, there are different actions that canbe done in each of these states. Hence, they shouldbe counted as two different states, and were mistak-enly merged into one. In the opposite direction, somestates were split during the scan into multiple groupsbecause of insignificant changes between different re-sponses. Section 6 suggests several explanations forthese incorrect results.

5.3.2 Results of Other Hash-Based Approaches

As was pointed in (Benjamin et al., 2010), when theequivalence relation is too stringent, the scan mightsuffer from a state explosion scenario, which requiresa significant amount of time and memory. This ob-servation accurately describes the scans involving theother two hashing methods, which often resulted in astate explosion scenario. Even when they includedmost of the states reported by the MinHash scans,they also contained hundreds of redundant states.This is due to the fact that the crawler exhaustivelyenumerated variations of a very limited number ofstates. The Facebook and Fandango scans reportedhundreds of states, with more than 95% of the statesbeing duplicate content. Moreover, since the Face-book application is very complex and diverse, thestate explosion scenario caused these scans to termi-nate due to memory problems.

It is interesting to analyze the performance of thenaıve approach and the DOM Uniqueness algorithm

when crawling the list of starred (tracked) projects ofGitHub users. While it is clear that the naıve approachfailed to merge two views with different number ofstarred projects, it is surprising to see that the DOMuniqueness algorithm also did not completely succeedin doing so. The reason for that is because the items ofthe list do not necessarily have the same DOM struc-ture. Therefore, there is not always a common repeat-ing pattern between different lists, and the lists werenot always reduced to the same DOM canonical form.

5.3.3 Results of Non Hash-Based Tools

The efficiency of the non hash-based scannersdropped significantly comparing to their performanceon simple RIAs. This is due to the fact that complexonline applications pose challenges that are still notwell-addressed in these crawlers. CRAWLJAX re-quires manual configuration of tailored comparatorsto perform better on certain applications. Moreover,CRAWLJAX is less scalable due to its all-pairs com-parison approach.

The efficiency rates of the jAk scans were nothigh, as the assumption that similar states must sharethe same normalized URL is not always true. In theFandango application, for example, pages providingmovie overviews contain the movie name as a pathparameter, and thus are not identified as similar. Thesame problem also occurred when jAk scanned Net-flix. Multiple pages of this website offer the samecontent and the same set of possible user interactions,albeit they are written in different languages.

The crawling components of jAk and CRAWL-JAX could not login to the Facebook application prop-erly, so a fair comparison is not possible in that case.

6 CONCLUSIONS

In this paper we present a locality-sensitive hashingscheme for detecting duplicate content in the domainof web application security scanning. The method isbased on MinHash sketches that are stored in a waythat allows an efficient and robust duplicate contentdetection.

The method is theoretically proven to be less strin-gent than existing DOM state equivalence strategies,and can therefore outperform them. This was also em-pirically verified in a series of experimental scans, inwhich other methods, whether hash-based or not, ei-ther completely failed to scan real-world applications,or constructed large models that did not capture theirstructure. As opposed to that, the MinHash schemeenabled successful and efficient scans. Being able

Figure 5: Successful split of two nearly similar states us-ing the MinHash algorithm. (a) Upper part: account settingview of a Facebook user. (b) Lower part: the same viewwith an additional menu open, adding a new functionalityto the previous state.

to better detect similar content prevents the MinHashalgorithm from exploring the same application stateover and over. This way, the scans constantly revealednew content, instead of exploring more views of thesame state. The MinHash scans got beyond the firstlayers of the application, and did not consume all thescan time on the first few states encountered.

Reducing the scan time and the complexity of theconstructed application model does not always comewithout a price. The cost of using an LSH approachmight be an increase in the number of different statesbeing merged into one, and this could lead to an in-complete coverage of the application. However, therisk of a more strict equivalence strategy is to spendboth time and memory resources on duplicate content,and thus to achieve poor coverage.

This risk can be mitigated by better optimizing pa-rameter values. The value of k, the shingle length, canbe optimized to the scanned application. One can takefactors such as the average length of a DOM state, orthe variance in the number of different elements perpage, when setting the value of k for a given scan.Tuning the similarity threshold per application maydecrease the number of errors as well. Of course, theprobabilistic nature of the method also accounts forsome of the incorrect results, as the Jaccard similarityestimations are not always accurate. Increasing thenumber of hash functions can reduce this inaccuracy.

The LSH scheme can be applied to any web ap-plication. However, there are applications in whichits impact is more substantial. Black-box security

scanners are likely to face difficulties when scanningcomplex applications. Such applications often heav-ily rely on technologies such as AJAX, offer dozensof possible user interactions per page, and contain agreat amount of near-duplicate content. The Min-Hash algorithm can dramatically improve the qual-ity of these scans, as was the case with Facebookor Fandango. For applications with less variance inthe structure of pages that belong to the same ap-plication state, such as Netflix, the DOM uniquenessstate equivalence mechanism can still perform reason-ably. But even in these cases, the MinHash algorithmreaches the same coverage more efficiently.

A locality-sensitive hashing function helps in abetter detection of similarity between states. How-ever, if a state is actually a container of a number ofcomponents, each having its own sub-state, then thenumber of states in the application can grow exponen-tially. It seems reasonable that applying our algorithmon every component separately would give better re-sults. In this context, the component-based crawling(Moosavi et al., 2014) could be a solution for decom-posing a state into several building blocks, which willall be analyzed using the MinHash algorithm.

The tested implementation of the MinHashscheme is very basic. There is a need to take moreinformation into consideration when constructing theDOM string representation of a response. For exam-ple, one can mark part of the elements as more rel-evant for security oriented scans, or use the value ofthe attributes as well. Theoretically, two pages canhave a very similar DOM and differ only in the eventhandlers that are associated with their elements. Al-though this usually do not occur in complex appli-cations, there were some rare cases where differentstates were mistakenly merged into one state due tothat reason. Another potential improvement is to de-tect which parts of the DOM response are more rele-vant than others. By doing so we can mark changes inelements that are part of a navigation bar as less sig-nificant than those occurring in a more central part ofthe application.

We believe that incorporating these optimizationsin the MinHash scheme would make this approacheven more robust and accurate. Combined with moresophisticated crawling methods such as component-based crawling, security scans can be further im-proved and provide useful information for their users.

FUNDING

This study was supported by fellowships from the Ed-mond J. Safra Center for Bioinformatics at Tel-Aviv

University, Blavatnik Research Fund, Blavatnik Inter-disciplinary Cyber Research Center at Tel-Aviv Uni-versity, and IBM.

ACKNOWLEDGEMENTS

We wish to thank Benny Chor and Benny Pinkas fortheir helpful suggestions and contribution. We aregrateful to Lior Margalit for his technical support inperforming the security scans. The work of Ilan Ben-Bassat is part of Ph.D. thesis research conducted atTel-Aviv University.

REFERENCES

Amalfitano, D., Fasolino, A. R., and Tramontana, P. (2008).Reverse engineering finite state machines from richinternet applications. In 2008 15th Working Confer-ence on Reverse Engineering, pages 69–73. IEEE.

Amalfitano, D., Fasolino, A. R., and Tramontana, P.(2010a). An iterative approach for the reverse engi-neering of rich internet application user interfaces. InInternet and Web Applications and Services (ICIW),2010 Fifth International Conference on, pages 401–410. IEEE.

Amalfitano, D., Fasolino, A. R., and Tramontana, P.(2010b). Rich internet application testing using ex-ecution trace data. In Software Testing, Verification,and Validation Workshops (ICSTW), 2010 Third In-ternational Conference on, pages 274–283. IEEE.

Ayoub, K., Aly, H., and Walsh, J. (2013). Document objectmodel (dom) based page uniqueness detection. USPatent 8,489,605.

Bar-Yossef, Z., Keidar, I., and Schonfeld, U. (2009). Do notcrawl in the dust: different urls with similar text. ACMTransactions on the Web (TWEB), 3(1):3.

Bau, J., Bursztein, E., Gupta, D., and Mitchell, J. (2010).State of the art: Automated black-box web applica-tion vulnerability testing. In 2010 IEEE Symposiumon Security and Privacy, pages 332–345. IEEE.

Ben-Bassat, I. and Rokah, E. (2019). Locality-sensitivehashing for efficient web application security testing.In 5th International Conference on Information Sys-tems Security and Privacy - Volume 1: ICISSP, ISBN978-989-758-359-9, pages 193–204. SCITEPRESS.

Benjamin, K., Bochmann, G. v., Jourdan, G.-V., and Onut,I.-V. (2010). Some modeling challenges when test-ing rich internet applications for security. In Soft-ware Testing, Verification, and Validation Workshops(ICSTW), 2010 Third International Conference on,pages 403–409. IEEE.

Berlin, K., Koren, S., Chin, C.-S., Drake, J. P., Lan-dolin, J. M., and Phillippy, A. M. (2015). Assem-bling large genomes with single-molecule sequencingand locality-sensitive hashing. Nature biotechnology,33(6):623–630.

Broder, A. Z. (1997). On the resemblance and containmentof documents. In Compression and Complexity of Se-quences 1997. Proceedings, pages 21–29. IEEE.

Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzen-macher, M. (2000). Min-wise independent permu-tations. Journal of Computer and System Sciences,60(3):630–659.

Choudhary, S., Dincturk, M. E., Bochmann, G. V., Jourdan,G.-V., Onut, I. V., and Ionescu, P. (2012). Solvingsome modeling challenges when testing rich internetapplications for security. In 2012 IEEE Fifth Inter-national Conference on Software Testing, Verificationand Validation, pages 850–857. IEEE.

Doupe, A., Cavedon, L., Kruegel, C., and Vigna, G. (2012).Enemy of the state: A state-aware black-box web vul-nerability scanner. In USENIX Security Symposium,volume 14.

Doupe, A., Cova, M., and Vigna, G. (2010). Why johnnycan’t pentest: An analysis of black-box web vulner-ability scanners. In International Conference on De-tection of Intrusions and Malware, and VulnerabilityAssessment, pages 111–131. Springer.

Duda, C., Frey, G., Kossmann, D., Matter, R., and Zhou, C.(2009). Ajax crawl: Making ajax applications search-able. In 2009 IEEE 25th International Conference onData Engineering, pages 78–89. IEEE.

elFinder (2016). elfinder. http://elfinder.org/. [Online].Facebook (2004). Facebook. https://www.facebook.com/.

[Online].Fandango (2000). Fandango. https://www.fandango.com/.

[Online].Fard, A. M. and Mesbah, A. (2013). Feedback-directed ex-

ploration of web applications to derive test models. InISSRE, volume 13, pages 278–287.

Frey, G. (2007). Indexing ajax web applications. PhD the-sis, Citeseer.

Garrett, J. J. et al. (2005). Ajax: A new approach to webapplications.

GitHub (2008). Github. https://github.com/. [Online].Gordeychik, S., Grossman, J., Khera, M., Lantinga, M.,

Wysopal, C., Eng, C., Shah, S., Lee, L., Murray, C.,and Evteev, D. (2010). Web application security statis-tics. The Web Application Security Consortium.

IBM Security AppScan (2016). Ibm security app-scan. https://www.ibm.com/security/application-security/appscan. [Online].

Indyk, P. and Motwani, R. (1998). Approximate nearestneighbors: towards removing the curse of dimension-ality. In Proceedings of the thirtieth annual ACMsymposium on Theory of computing, pages 604–613.ACM.

Jaccard, P. (1901). Distribution de la Flore Alpine: dans leBassin des dranses et dans quelques regions voisines.Rouge.

Levenshtein, V. I. (1966). Binary codes capable of cor-recting deletions, insertions and reversals. In Sovietphysics doklady, volume 10, page 707.

Mesbah, A. and Van Deursen, A. (2008). Exposing thehidden-web induced by ajax. Technical report, Delft

University of Technology, Software Engineering Re-search Group.

Moosavi, A., Hooshmand, S., Baghbanzadeh, S., Jourdan,G.-V., Bochmann, G. V., and Onut, I. V. (2014). Index-ing rich internet applications using components-basedcrawling. In International Conference on Web Engi-neering, pages 200–217. Springer.

Netflix (1997). Netflix. https://media.netflix.com/. [On-line].

Nicol, G., Wood, L., Champion, M., and Byrne, S. (2001).Document object model (dom) level 3 core specifica-tion. W3C Working Draft, 13:1–146.

Pellegrino, G., Tschurtz, C., Bodden, E., and Rossow, C.(2015). jak: Using dynamic analysis to crawl and testmodern web applications. In International Workshopon Recent Advances in Intrusion Detection, pages295–316. Springer.

Pomikalek, J. (2011). Removing boilerplate and dupli-cate content from web corpora. Disertacnı prace,Masarykova univerzita, Fakulta informatiky.

Popic, V. and Batzoglou, S. (2016). Privacy-preserving readmapping using locality sensitive hashing and securekmer voting. bioRxiv, page 046920.

Roest, D., Mesbah, A., and Van Deursen, A. (2010). Re-gression testing ajax applications: Coping with dy-namism. In 2010 Third International Conference onSoftware Testing, Verification and Validation, pages127–136. IEEE.

Tai, K.-C. (1979). The tree-to-tree correction problem.Journal of the ACM (JACM), 26(3):422–433.

locality-sensitive hashing for efﬁcient web application ...ilanben/papers/lsh-bb.pdf ·...

Documents