modeling coherency in generated emails by …helped us prepare, process and evaluate our training...

Modeling Coherency in Generated Emails by Leveraging Deep NeuralLearners

Avisha Das and Rakesh M. Verma

University of HoustonDepartment of Computer Science

Houston, TX 77204, USA

[email protected], [email protected]

Abstract. Advanced machine learning and naturallanguage techniques enable attackers to launch sophis-ticated and targeted social engineering based attacks.To counter the active attacker issue, researchers havesince resorted to proactive methods of detection. Emailmasquerading using targeted emails to fool the victimis an advanced attack method. However automatictext generation requires controlling the context andcoherency of the generated content, which has beenidentified as an increasingly difficult problem. Themethod used leverages a hierarchical deep neural modelwhich uses a learned representation of the sentencesin the input document to generate structured writtenemails. We demonstrate the generation of short andtargeted text messages using the deep model. Theglobal coherency of the synthesized text is evaluatedusing a qualitative study as well as multiple quantitativemeasures.

Keywords. Bidirectional LSTM, Doc2Vec embeddings,Proactive defense, natural language generation (NLG).

1 Introduction

Adversarial learning is a major threat to thefield of computer security research. Withthe advancement in technology, the growingdependency on the Internet has exposed usersto serious cyber threats like phishing andpharming. Despite considerable research tocounter such threats, staggering numbers ofindividuals and organizations fall prey to targetedsocial engineering attacks incurring huge financiallosses.

Although attackers change their strategies,previous research [4] has shown that electronicmails (emails) are a popular attack vector.Emails can be embedded with a variety ofmalign elements [11] like poisoned URLs tomalicious websites, malware attachments as wellas executables, documents, image files, etc.Anti-Phishing Working Group (APWG) reportsover 270,5001 unique phishing email campaignsreceived in the 3rd quarter of 2018, rising from atotal of around 233,6002 unique reports identifiedin the 4th quarter of 2017. Phishing reportsalso reveal the consistent rise in phishing attackstargeted towards financial institutions like paymentprocessing firms and the banking sector. Thestatistics demonstrate how the threat is worseningas attackers continue to devise more sophisticated(and maybe more effective) ways of scammingvictims.

Innovative and unseen attack vectors can trickpre-trained classification techniques [37], thusplacing the victim at risk. In email masqueradingattacks, attackers after compromising the emailaccount of an individual can carefully construct afraudulent email then sent to the contacts knownto the compromised individual. This has seriousimplications, because the attacker has gaineduninterrupted access to the inbox, outbox andother private details of the compromised person.Thus by exercising caution, the attacker canemulate the content and context of the emails

1http://docs.apwg.org/reports/apwg trends report q3 2018.pdf2http://docs.apwg.org/reports/apwg trends report q4 2017.pdf

arX

iv:2

007.

0740

3v1

[cs

.CL

] 1

4 Ju

l 202

0

written by the individual and can communicate withhis contacts as a legitimate entity, successfullyevading detection and causing harm to the victim.

However, construction of the perfect deceptiveemail requires fine-tuning and manual supervision.While a fake mail constructed manually by anattacker can guarantee a higher chance ofsuccess, the process is both time and laborintensive. In contrast, an automated text generatorcan be trained to synthesize targeted emails muchfaster and in bulk, thereby increasing the oddsof a successful attack. However, the bottleneckin this case, lies in whether the system cangenerate high quality text, free from common flagslike misspellings, incorrect and abusive language,over-usage of action verbs, etc., which can bepicked up by a classifier easily. Thus, proactiveresearch in this area of deception based attacksusing email masquerading techniques requiresfurther sophisticated experimentation.

Advances in the field of natural language pro-cessing have introduced newer and sophisticatedalgorithms which enable a machine to learn andgenerate high-quality textual content on a givencontext. Grammar based tools like the DadaEngine [2], N-gram language models [7] as well asdeep neural learners [44] have been used to studyand replicate natural language based attacks. Theaim is to facilitate proactive research by predictingnewer attacks and reinforce against such unseenyet impending threats.

At the hands of an attacker, language generationtechniques can become dangerous tools fordeception. With access to proper training data,deep learning neural networks are capable ofgenerating textual content. This property hasbeen leveraged by researchers for generatingtweets [36] and poetry [15], [43], etc. While limited,proactive research has been pursued by usingdeep learners for generation of fake reviews [44],grammar based techniques [2] as well as simplisticdeep networks [9] have been leveraged for emailgeneration. Thus, we can assume that it is not longbefore phishers and even spammers resort to suchtechniques to generate newer kinds of maliciousattack vectors.

Following a proactive mode of study, we identifythe underlying implications of how an automated

machine learning technique, here, deep learnerscan be leveraged to synthesize email bodies forthe purpose of email masquerading attacks. Alongwith demonstrating the systems’ performanceusing qualitative and quantitative methods, westudy the effectiveness and practicality of suchsystems by comparing a hierarchical deep networkwith a baseline word prediction model. Our keycontributions are as follows:

— A collection of the most common signals thatset apart a malicious email from its legitimatecounterpart (Section 2.1) as mentioned inprevious literature [40], [11], [14] on analyzingphishing emails. We focus on cues thatare more prevalent in email bodies for ourevaluation. This is necessary to observe thequality of the system generated emails.

— Leveraging deep neural networks for generat-ing targeted email bodies (Section 2.3). Whilegeneration of coherent emails is challeng-ing [9], we use a hierarchical network thatconsists of two stages - an architecture whichuses a word prediction model to generateprobable candidate sentences which are thenpassed onto a sentence selection model,based on distributed vector representationsof the email content, to select the bestpossible set of sentences. Such a two-stagedarchitecture should be suitable for generatinglonger content while maintaining coherency.

— Comparing the performance of the hierarchicalsystem with a baseline word prediction basedmodel by using multiple quantitative andqualitative metrics (Section 4). Additionally,we analyze the effectiveness of our systemby measuring the syntactical correctness,coherency, fluency and legitimacy of the fakeemails by conducting a human evaluation.

2 Background

This section presents a list of cues usuallyobserved in spoofing emails that demarcate suchattack vectors from their legitimate counterparts.Highlighting and studying such common signalshelped us prepare, process and evaluate our

training data as well as the generated emails.The goal of the method is defined after studyingthe common features in malicious emails, followedby a detailed description and demonstration ofthe baseline word prediction and the hierarchicalsentence selection models used for the generationtask.

2.1 Textual features in spoofing emails

Use of textual features, like presence of commonaction words, organization names, poisoned linksto malicious webpages of financial institutions,grammatical errors, etc., is common in phishingemail detection methods [40], [41], [39]. Moreover,researchers have widely studied spam, phishing,spear phishing emails to identify common signsthat appear across malicious emails [11], [14].However, since such signals are certain signs ofmalign intent an attacker would consciously avoidincorporating these words in a targeted email.Assuming this in mind, the generator should alsolearn to identify and eliminate overuse of suchwords.

Therefore, we curate a list of textualcues frequently used in spoofing emailsafter careful review of phishing emailliterature [11], [40], [41], [39], [14], [6]. Thelist of these textual cues along with examples havebeen provided in Table 1. Researchers prefer totrain their proposed detection methods on publiclyavailable data. Since phishing emails are fairlyrare, we base our evaluation on the largest publiclyavailable dataset of malicious emails: NazarioPhishing Corpus.3 The Base-64 encoded HTMLcontent in the emails are filtered out and finally3,392 fairly clean emails with textual content (>10words) are used to extract the enlisted spoofingcues.

While machine learning systems can detectcommon cues, these detectors largely dependon historical data. To keep up with advancedreinforcement techniques, a phisher also resortsto employing sophisticated techniques for makingtheir attacks more targeted to increase rate ofsuccess. Thus, for social engineering basedattacks like email masquerading, spear phishing,

3 https://monkey.org/˜jose/phishing/

Table 1. Common Spoofing Cues in Phishing Emailbodies

Feature Types Examples

Organization Names

(a) Financial like eBay, PayPal, Bank of America,Western Union(b) Government like Internal Revenue services,United Parcel Service(c) Software like Dell, Microsoft, Apple

Action Verbs andUrgency Adverbs

(a) Action verbs like click, follow, visit, go, update,apply, submit, confirm, cancel, dispute, enroll,login, answer, reply(b) Adverbs implying urgency like today, instantly,straightaway, straight, directly, once,urgently, desperately, immediately, soon, shortly,presently, before, ahead, front

Persuasion Principles

(a) Authority like an email from a bank asking thevictim to update password of his online account(b) Social proof denoted by Emails fromthe IT department of the target’s institution(c) Distraction using emails where a targetis tempted to click a link in order toreceive a prize(d) Reciprocation appealing the victim to respondlike resetting a password or paying a billby clicking a link to a fraudulent website

Misspelled Words Typographical errors like Paypl, Bnk Amrica, etc.

Presence of Links URLs to malicious websiteslike https://www.maybank2u.com.my, etc.

Other languages Non-English words like Aviso Importante de BBVA,societe, Transaktionen

or targeted phishing, an attacker may chooseto avoid such easily identifiable red flags whilegenerating fake emails. Therefore, in our proactivestudy, we also refrain from overuse of suchspoofing cues in the synthesized emails.

2.2 Task Description

Email masquerading is steadily growing into aserious cybersecurity threat and has startedgaining much attention from security researchers.This paper aims at providing a proactive paradigmto this issue. Some of the key questions are:Given a large dataset of manually written emails,can an automated system learn to emulate thewriting style of a human? Is the generated emailcontent coherent and syntactic? Can an individualdifferentiate between a system generated emailand a manually written one?

The hierarchical model is compared with abaseline word generation architecture. Further-more, the system performance is studied usingmultiple quantitative metrics along with a qualitativeevaluation conducted through a human survey.

2.3 Architecture for Text Generation

Textual content can be considered as a sequenceof words and characters placed together to conveymeaningful information. In the realm of textgeneration, deep neural architectures have seenunprecedented success in emulating one’s writingwhen trained on huge amounts of written textualcontent [44], [15], [38].

Recurrent Neural Networks (RNNs) are ca-pable of retaining information learned from textsequences – helpful for learning representationsof word sequences in the input text. Thetrained language model can subsequently gener-ate samples similar in form and context to theinput data. We leverage this ability of RNNsfor our proactive protection scheme - generationof targeted emails suitable for spear-phishing ormasquerading attacks. Moreover, Long ShortTerm Memory (LSTM) Networks, an improvedversion of RNN, have proved to be better athandling dependencies in longer sequences oftext [27], [44].

The architectures use words as units forgeneration [43], [21] by leveraging LSTMs asbuilding blocks for learning the language model.We use a hierarchical model that merges sentenceselection with iterative text generation to generatethe best set of human readable samples. Wecompare the architecture, sampling and generationphases for these two architectures followed by anevaluation of their performance.

2.3.1 Training a Word Prediction Model.

Our first model is a straightforward word-basedlanguage model built using RNNs [43]. Inour implementation, we use Bidirectional LSTMs(Bi-LSTMs) [19] as the network to build themodel. Figure 1 shows the overall model forword prediction. We describe the training andgeneration phases for this baseline generationarchitecture.

Training Phase. We use Bidirectional LSTMs(Bi-LSTMs) as our network for text generationmodel. The input to our text generation system areone-hot vector sequences of words (w0, w1, w2, ...,wN−1) where N is the sequence length. At everytime step t (starting with 0), we feed a word (wt)

into the hidden layer which predicts the next wordwt+1. Through error optimization, the model learnsto capture the best representation of coherent wordsequences that constitute the textual content to begenerated - in this case, the body of an email.

Generation Phase and TemperatureRegulation. During the generation phase,we feed a sequence (N ) of seed words(Wseed0

,Wseed1,Wseed2

, ...,WseedN ) into thetrained Bi-LSTM model, used to start off theword generation system. When the model gets aseed word (Wseed0

) as input, it outputs the nextword (W1) by selecting the one most likely tooccur after Wseed0

depending on the conditionalprobability distribution, P (W1|Wseed0). When thisaforementioned step is extended for an inputsequence of N seeds, the model (Figure 1) cangenerate a text body of N + 1 words, the N + 1th

word being the output.The final layer of the model, which calculates the

above conditional probability, is a softmax normal-ization which is used for computing the distributionfor the next word followed by subsequent sampling.We use temperature(τ) as the hyper-parameterfor selecting our word samples - regulating theparameter τ in Equation 1 encourages or controlsthe diversity of the generated text. The noveltyor eccentricity of the generative model can beevaluated by varying the temperature parameterbetween 0 < Temp. ≤ 1.0. While, lower values ofτ generate relatively deterministic samples, highervalues can make the process more stochastic.

Equation 1 shows the probability distributionbuilt by the model for the sequences of wordsalong with the incorporation of temperature controlparameter(τ ) P (Wt+1|Wt′≤t) = softmax(Wt),

P (softmax(W jt )) =

eWjtτ∑n

j=1 eWjtτ

(1)

2.3.2 Hierarchical Sentence Prediction Model.

Controlling global coherency and structure inautomated text generation is a non-trivial task.Using a straightforward word prediction modelmakes it increasingly difficult to control the qualityas well as coherency of the generated text. We use

LSTM LSTM LSTM

LSTM LSTM LSTM

Bidirectional LSTM

<Word 1> <Word 2> <Word N>... ...One-hot Encoded Sequence of Seed Words

LSTM

LSTM

<Word 3>

<Word 2> <EOS>

SoftmaxActivation

Layer

Predicted Sequence of Words

<Word 3> <Word 4>

Fig. 1. Word generation model with Bi-LSTMs

LSTM LSTM LSTM LSTM

LSTM LSTM LSTM LSTM

Bidirectional LSTM

Doc2Vec Embeddings Vectors

<Sentence 1> <Sentence 1> <Sentence N>... ...

Sequence of Input Sentences

Fully Connected Dense Layer

Vector of the nextpredicted sentence

Fig. 2. Sentence Selection model using Doc2Vec with Bi-LSTMs

an hierarchical model [10] consisting of two stages- generation of sentence candidates4 followed bysentence selection model. For the sentenceselection model, we use Doc2Vec [26] embeddingsto learn representations of the email bodies, whereeach email is treated as a document. Given a setof sentences as starting point, we describe how thetrained Doc2Vec model is used to generate a newsentence.

Sentence Selection using Doc2Vec. Use ofembeddings has been regarded as a favorable wayto represent context in a piece of text, either in theform of word phrases, sentences or paragraphs.Doc2Vec [26] can effectively learn the numericrepresentation of a paragraph or even a document,irrespective of its length. In this model, we useDoc2Vec to learn better representations of thesentences in a piece of text - for example, the bodyof an email. Figure 2 shows the stages of thesentence selection model. The model starts withtraining a Doc2Vec model on the entire document.The goal is to predict the next sentence (here,its vector representation) given a sequence ofsentences as starting point. As shown in Figure 2,using the trained Doc2Vec model, we vectorize aset of seed sentences and then feed the vectors tothe Bi-LSTM layer in order to learn the model forsentence prediction.

4sequences of words highly likely to appear together in asentence in email body

Generation of Sentence Candidates. Thefirst stage deals with the generation of a set ofcandidate sentences and/or phrases. The inputto the whole model is a set of seed sentencesequences. The first level of the network isa trained word prediction model which takes asinput the last N words from the given seedsentences. The model architecture is sameas the one described in Section 2.3.1. Thissequence of N words is used to generate a setof candidate sentences which are transformedinto their Doc2Vec vector representations forcomparison and selection.

Generation of newer sentences. The seedsentences are fed into the trained Bi-LSTM-basedDoc2Vec model for sentence selection. Thisconstitutes the second stage of the hierarchicalarchitecture and selects the best sentence vectordepending on the input sequences. The generatedvector is then compared with the selectedcandidate sentences - output of first level usinga cosine similarity function. The most similarcandidate is then produced as output of thehierarchical generation model.

Generation phase and Temperature Regula-tion. The parameters used and the purposefor temperature regulation during the generationphase using the hierarchical model is similar tothe generation phase explained in Section 2.3.1.The selected best sentence from the generated

candidate sentences or phrases are merged withthe sequence of input seeds and fed into the modelto generate newer sentences.

Therefore, the complete hierarchical or multi-stage architecture has been shown in Figure 3.The sequence of sentences, called seed sen-tences (Sent1, Sent2, ..., SentN ), are first chosenas input to be given to two pre-trained models- one, a word-based language model and theother a model trained on Doc2Vec embeddingsof sentences in documents. The word-basedlanguage model uses a sequence of the last Nwords from the given sentences as input. Itthen outputs the most probable word to appearin the N + 1th position based on the trainedmodel. The model repeats this step in a feedbacksetup to generate sequences of sentences.These generated sentence sequences are calledcandidate sentences. We refer to these generatedcandidate sentences as Sentcand1, Sentcand2, ...,SentcandX .5

The seed sentences (Sent1, Sent2, ..., SentN )are converted to their Doc2Vec embedding vectors– Sentd2v1, Sentd2v2, ..., Sentd2vN – using thetrained Doc2Vec vector transformation model.These sentence vector representations are thenfed into the BiLSTM-based sentence selectionmodel which had been trained on sentence embed-dings using Doc2Vec representations (Figure 2).Given a sequence of sentence embeddings, thismodel selects the most suitable sentence vector tofollow the given sequence.

In the final step, we compare each generatedcandidate sentence to the selected sentencevector. To explain this step, we select the candidatesentence Sentcand2 and convert it into its Doc2Vecrepresentation (Sentcand−d2v2). The selectedsentence vector from the Doc2Vec sentenceselection model given the N seed sentencesis Sentd2vN+1. We use the Cosine similaritymetric to calculate the similarity between the twovectors Sentcand−d2v2 and Sentd2vN+1. This stepis repeated for all the candidate sentences andthe sentence candidates with the highest similarityvalues are then chosen as the output of the model.

5X refers to the number of candidates to be generated.

3 Data Collection and Setup

A large amount of high quality data is requiredfor text generation. This becomes crucial whenautomating the composition skill and pattern ofan individual. Here, we describe the source andcollection steps of the data used in the email bodygeneration task.

3.1 Data Collection

To produce the best sample to emulate atargeted attack, the trained generation modelmust synthesize an email body similar to anemail composed by the individual whose writingstyle it is trying to emulate. Thus, for trainingour model we make use of the largest publiclyavailable source of legitimate emails, the EnronCorpus [12]. Table 2 summarizes the statisticsabout our dataset. We evaluate the averagenumber of sentences, average vocabulary as wellas average number of words in a typical email body.The step-wise data collection process along withthe data preparation and pre-processing steps aregiven below.

Building the dataset of legitimate emails.We use the largest publicly available datasetof benign emails - Enron Corpus.6 Since weaim at synthesizing the writing style of humans,it is necessary to make the process of emailmasquerading more effective. For building ourtraining and evaluation dataset, we make useof ‘legitimate’ emails which we collect based onthe following assumptions. We assume that theemails that an individual, within Enron, receivesare legitimate i.e., they have been identified asbenign by the mail server. Also, the emails thatthe individual sends out are also benign. TheEnron corpus consists of a large number of emailsranging from spam, deleted advertisements, etc.,which is useless for training purposes. Hence,for our purpose, we use emails from two types ofemail directories - the Inbox (Received folder) andOutbox (Sent folder) of the individuals within Enroncorpus.

6https://www.cs.cmu.edu/~enron/enron_mail_

20150507.tar.gz

https://www.cs.cmu.edu/~enron/enron_mail_20150507.tar.gz


LSTM LSTM LSTM

LSTM LSTM LSTM

Bidirectional LSTM

<Sentence 1> <Sentence 2> <Sentence N>... ...

Sequence of Input Sentences

<Word 1> <Word 2> <Word N>... ...

Sequence of Seed Words

Trained Bi-LSTM Word Prediction Model

Doc2Vec VectorTransformation

.

.

.

Generated CandidateSentences

<Candidate 1>

<Candidate 2> . .

<Candidate N>

LSTM LSTM LSTM

LSTM LSTM LSTM

Bidirectional LSTM

Trained Bi-LSTM Doc2Vec Sentence Selection Model

CandidateSentenceVectors

Selected NextBest Sentence

Vector

Doc

2Vec

Vec

tor

Tran

sfor

mat

ion

Similarity Calculation

Most Similar Candidate Sentence

Fig. 3. Word prediction with sentence selection using Bidirectional LSTMs

The collection process involved the followingsteps: (a) We use the collection of 517,401emails from the publicly available Enron corpus;(b) We choose the emails that are composed ofa minimum of 10 words and contain a minimum ofone sentence; (c) We also calculate the averagestatistics of the dataset to determine the genericlength, number of words, vocabulary size of theemails we plan to synthesize.

Table 2. Legitimate Data Statistics

Attributes ValuesDataset Size 517,401Total Number of Words 167,692,695Avg. Number of Words 324Total Vocabulary 2,028,671Avg. Vocabulary 143Avg. Sentence Length 25Total Number of Sentences 6,590,091Avg. Number of Sentences 13

3.2 Assumptions and Data Preprocessing

We assume a typical scenario where the attackerhas compromised and gained access to the mailaccount, all email communication, personal details,email contact list, etc.

Our primary focus is on the automatedgeneration of textual content of the body ofan email. Hence, we do not account for thegeneration or evaluation of header information inan email. Moreover, we also do not consider thegeneration of email attachments, links or emailaddresses that maybe present in the body. Thegoal is to make the synthesized email contentmore generic to allow inclusion of personalizedinformation like named entities, location names,etc., depending on the communication betweenthe victim and the compromised individual. Also,from the architectural viewpoint, including namedentities, personalized details, etc., is unnecessaryduring training phase and will eventually blow upthe vocabulary of our word-based architecture,therefore confusing it. To avoid the inclusion ofunnecessary word tokens and to generalize thecontext, we replace the named entities in the email

bodies with the ent tags - this includes replacingperson names as well as locations (if applicable).We use the Entity Recognizer Module implementedin SpaCy for Python 3.6 for entity tagging andreplacement.

In most cases, the body of the Emails containelements which are of little or no value to ourtraining and evaluation process. We apply thefollowing pre-processing steps to clean our datawhile making sure not to delete any importantinformation:

— We remove all trailing spaces, newlinecharacters, etc., from the text body

— Replacement of named entities (person,location, etc.) with ent tag

— We replace Email addresses in the body withemailID tag

— Replacement of URL links with the link tag

— Adding the Start of Text (<SOT>) and Endof Text (<EOT>) tags to the email bodies torespectively mark the beginning and end of thecontent.

— Sanitization of non-ASCII characters from thetext

— Removal of HTML text fragments and brokenlinks from text body

— We also lowercase our textual content andremove special characters like #, $, %, , etc.

3.3 System Setup and Methodology

The system was developed in Python 3.6 usingKeras (Version 2.2.4) and TensorFlow (Version1.11.0). In our experimental evaluation, the LSTMnetwork consists of 128 hidden units, since weare using Bi-LSTMs, the total number of states is256. We consider an unrolling size of 15, i.e., thenetwork looks back up to a sequence of 15 wordsto predict the next probable word. Among the otherhyperparameters, we consider a batch size of 50with a learning rate 10−2. Each architecture wastrained for a total of 50 epochs for building thelanguage model.

The above set of hyperparameters has beenchosen based on our empirical evaluation asexplained in this section. We considered ahyperparameter optimization scheme where welook at the GridSearch technique provided by theScikit-Learn library. We tune the networks for batchsizes 35, 50, 75 and epochs of 10, 20 and 50. Wealso experiment with LSTM hidden units of 128 and256. We choose these values since the trainingtime is much more reasonable. We manuallyselected the unrolling size length by experimentingwith values between 10 to 25 in increments of5 units. Similarly, the learning rate (10−2) wasselected after running the algorithm with learningrates of 10−3, 10−2 and 10−1.

All our experiments were conducted on a serverwith 4 Tesla M10 GPUs using CUDA (Version9.1.85) with a 3.20GHz Xeon CPU E5-2667and 512 GB of memory. We use Adam [25]algorithm for the gradient optimization in theword-based as well as the Doc2Vec languagemodeling architectures. For the calculation of theloss functions, we use categorical crossentropyfor the word-based model and a regression basedloss function, Log-Cosh for learning the model forDoc2Vec embeddings.

For the word prediction model, the starting seedof 15 words were chosen randomly from a setof starting sentences in the email bodies. Forgenerating text using the hierarchical architecture,the system was given a set of candidate sentencescomprising of the first sentence of 10 randomlyselected legitimate emails from the Enron corpus.

4 System Evaluation and Analysis

This section describes the experimental setup forthe training and generation phases of the baselineand deep generation model. The evaluation setupfor analyzing system performance based on thequantitative and qualitative nature of the generatedcontent as well as their practical implications havebeen discussed in detail.

4.1 Evaluation Setup

For our evaluation setup, we use a total dataset of517,140 emails. We set apart 5% of the datasetfor validation during training and the rest was usedfor training the model. The separate training andvalidation subsets are crucial for determining theperformance of our prediction model on unrelateddata. We train the deep generative models eachfor 50 epochs with the validation of the modelperformance after every epoch, finally selectingand saving the model that achieves the bestperformance on validation data.

4.2 Evaluation Metrics

We evaluate the performance of the generativemodel along three dimensions: (a) a quantitativeevaluation using a word-gram based measure ofoverlap and uniqueness between the synthesizedand the targeted email bodies as well as modelperplexity; (b) a simple qualitative evaluationwhere we inspect the samples generated by thegenerative word-based and hierarchical modelsby selecting some samples; and finally, (c) ahuman evaluation which consists of a survey with6 participants providing feedback on the overallsyntax, coherency, and fluency of the synthesizedemail bodies as compared to their legitimatecounterparts.

4.2.1 Quantitative Evaluation

Measure of model perplexity is an establishedmethod to determine the ability of a lan-guage model. We compare the perplexityof the word-based model (Generatorword) withthe deep hierarchical sentence selection model(Generatorsentence). A lower perplexity ensures amore stable predictive ability. Model perplexity isdefined as:

PPL = 2NLLT (2)

Here, NLL refers to negative log likelihood of themodel and T is the length of the training sequencefor which the model perplexity is measured. A moredetailed definition for a word-based generationmodel:

PPL = 2−1T

∑T−1t=0 log(P (Wt+1|W0...Wt)) (3)

During our second phase of quantitative evalua-tion, we use a novel metric to observe the se-mantic coherence across the generated samples.Coherency in an email should account for theadequacy in information between the sentences oreven words in the email body [27]. Taking intoaccount the mutual information between a wordand its predecessors in the sentence to calculatesemantic coherence, we can define the followingmeasure:

Coherence =1

T

∑t

(log pfwd(wt|wt−1)+

log pbwd(wt−1|wt))

(4)

The mutual information between two words actsas a measure of how likely it is for the words toappear together in the context, thus a higher valueof mutual information means better coherence [20].We take into account both the forward as well asthe backward probabilities of occurrence of thebigrams (wt,wt−1). To control the influence of thelength of the generated text, the values are scaledby the total length (i.e., number of words) of thegenerated sentence/sequence.

Measuring perplexity can provide a false senseof language model’s performance - while a lowvalue denotes the model’s ability to replicate theinput text; it may emulate the characteristics ofthe input text too much. To measure the overlapbetween the generated and targeted email text,we consider the ratio of common word-N-gramsbetween the ground-truth (or legitimate) email andits ‘fake’ counterpart. For evaluation purposes, weselect N = 3, i.e., trigram overlap.

Table 3. Quantitative evaluation using generated emails

Model Perplexity Coherence TrigramOverlap (%)

Generatorword 8.97 2.8 43.7Generatorsentence 4.59 4.9 66.8

The results of the evaluation have beenpresented using a dataset of 100 generatedemails. The coherence measure has beencalculated based on the language model built usingthe Brown Corpus7 available with NLTK package

7https://www.nltk.org/book/ch02.html

in Python. We use a list of top 1000 mostcommon trigrams in the legitimate emails datasetfor the calculation of our trigram overlap. Weobserve an increased measure of coherence aswell as occurence of common trigrams in theemails generated by the sentence based models.

4.2.2 Qualitative Evaluation

The most simplistic form of qualitative evaluation isobserving the nature of randomly selected samplesfrom the generated text. We have defined commonmalicious cues in Section 2.1. Drawing motivationfrom Table 1, we inspect the generated emailcontent for signs of obvious malign intent as wellas other prominent features which can be used toevaluate our system performance.

Here, we include two samples each, generatedusing the word-based model and the hierarchicalmodel at different temperatures for the purposeof our analysis. One noticeable feature of theemails generated by the systems is the number ofsentences in the samples. While the more genuinelooking samples selected from the outputs of theGeneratorsentence model tend to be longer i.e.,consists of more than one sentence, the samplesfrom the baseline model (Generatorword) arecomparably more terse. Also, longer sequencesfrom the deep generation model generated at ahigher τ tend to include more ent tags at randomplaces within the text.

(A) Samples generated by Generatorword:

A1. Sampling at τ = 0.9: ent Couldyou give me a moment for why I wouldappreciate everyone’s support . ent Bestwishes, entA2. Sampling at τ = 0.5: Hi , I hope youare having a good ent interview. ent

(B) Samples generated usingGeneratorsentence:

B1. Sampling at τ = 0.5: Yes . We had entthem to set up a date . ent We are now inLondon on Monday for Christmas.B2. Sampling at τ = 1.0: Hi ? ent Whatdo you have the steps or makes it a minute.ent P.S. You can ent email ent with changesto a reports that ent has delivered several tothe members with the office

Human Evaluation. While automated quali-tative and quantitative evaluation methods seembelievable, these measures just paint half thepicture. One of the main contributions of proactiveresearch in cybersecurity involve strengtheninghumans/individuals using the web and internet.Since humans are considered the weakest link insecurity research, we evaluate the efficacy of thesefake emails by conducting a human evaluationstudy.

We provide our 6 participants with a survey of 24email bodies. For each email body, the participantsare required to rate the quality of the email contentbased on three attributes - syntactical correctness,coherency and fluency. The scores for each ofthese attributes are based on a Likert scale ∈[1,5]. We also ask each participant to identifywhether the email body is legitimate or not -which requires a yes/no response. Of the 24email bodies, 12 emails are genuine or manuallywritten and the rest are fake or system generatedemails. Of the 12 system generated emails, weconsider 7 emails which are generated by oursentence selection model (Generatedsentence) and5 emails generated by our baseline word predictionmodel (Generatedword). We choose a balancedratio to discern the effectiveness of our humanparticipants in detecting fake emails from theirgenuine counterparts.

Table 4 demonstrates the results of the humanevaluation setup. We report the scores forsyntax, coherency and fluency, averaged acrossall participants, on the combined set of systemgenerated emails - Generatedall. We also reportthe same on the subset of emails generated byeach of the baseline and deep generation models -Generatorword and Generatedsentence respectively,to compare the difference in system performance.We also report the scores on the combined

set of system generated emails - Generatedall.We further compare the statistical significancebetween the results on the manually written emails(Truth in Table 4) with Generatedall - syntax: 4.01(p− value = 0.1377), coherency: 3.31 (p− value <10−5) and fluency: 3.19 (p − value < 10−5). Weobserve that while the difference in syntax scoresbetween the legitimate and generated email bodiesare not statistically significant,8 the generatedcontent is still behind in terms of coherency andfluency. However, in contrast, we observe that ourhuman participants tend to have a low detectionrate (≈57%) when it comes to generated emails,as observed in Table 4, with the detection ratedropping to approximately 38% in case of emailsgenerated by the sentence selection model. Wealso include the mean and standard deviation inthe number of words (W , SDW ) and sentences (S,SDS) in the generated and the manually writtenemails used in our survey.

5 Related Work

The need to counter the active attacker issue hasgiven rise to proactive research methods. Whilethere exists classical techniques for phishing emaildetection [41], [6], [3], state-of-the-art maliciousemail detectors fail to detect a sophisticated ortargeted attack. Riding on the wave of proactiveresearch, many researchers have delved deeperinto the realm of attack generation for differentattack vectors but none as serious as emails.While the use of fully automated methods for textgeneration has been considered, there have notbeen much investigation into automatic modelingof coherent emails which can be used for targetedattacks.

5.1 Natural Language Generation

Use of deep neural networks have enabledbuilding fully (or partially, with feature engineering,)automated models for natural language generation.From the perspective of written text, a substantiallytrained deep network is capable of emulating thewriting style of an individual. This property has

8p − value calculated using the Unpaired Two SamplesWilcoxon test or Mann Whitney test

been leveraged in natural language research bymaking deep learners write a wide variety of textShakespearean Sonnets [43], poetry [15], [45],answer generation [28].

While the use of grammar [2], templates [7], [8]and statistical language based models (e.g.N-grams [17]) are popular, Recurrent NeuralNetworks (RNNs) have been shown to be amore suitable choice owing to their ability tolearn dependencies across the textual context [18].Long-Short Term Memory (LSTM) networks aremore suitable for longer text sequences. However,while a fully automated system seems lucrative -controlling the coherency, topic and structure of thegenerated text can be quite challenging.

Using RNNs for sequence-to-sequence learningis a popular practice for text generation as shownin [13], [46], [31], [23]. Since, simple encoder-decoder architectures fail to model importantor meaningful words and phrases, [24], [13]use attention based encoder-decoder models forpreserving coherence and context in the generatedtext. Other generation techniques include deeplearning with Markov Models [42], variationalauto-encoders [32], generative adversarial net-works [30]. While the aforementioned researchworks experiment with the architectures, suchstrategies are prone to generating incoherentcontent as the length of the generated textincreases. The architecture used in this paperattempts at fixing this issue at a more global level -by implementing coherence in each sentence andthen selecting the best sentence to be included inthe generated text.

5.2 Attack Generation

Growth in the field of proactive research hasbecome more prominent in an effort to counteractive attackers. Phishing is a largely unsolvedcyber threat, worsened more by the proliferationin spear-phishing attacks. The ability of theperpetrators to deceive an individual by behavingas a legitimate entity can be automated forwidespread social engineering attacks as studiedin [29] and [22]. Researchers in [2], [9], [16],look at ‘weaponinizing’ advanced machine learningtechniques to launch sophisticated yet automated

Table 4. Human evaluation results on the automated and true emails. Scores for Syntax (Syn), Coherency (Coh) andFluency (Flu) vary between [1, 5]. Detection Rate has been reported as a percentage

Email Content Scores DetectionRate W SDW S SDSSyn Coh Flu

Generatedword 3.63 2.8 2.6 83.33 12 2.88 1.2 0.45Generatedsentence 4.29 3.67 3.62 38.09 13 6.21 2 0.81Generatedall 4.01 3.31 3.19 56.94 12 5 2 0.78Truth 4.42 4.56 4.47 75 24 14.8 3 0.94

targeted attacks. While [2] uses a grammar-basedapproach for synthetic email generation; Das et.al. [9] uses a more automated deep neural networkfor email generation, which suffers from incongruityin the generated context as shown by theirevaluation. Other studies in an adversarial setting,which leverage natural language techniques, havebeen pursued in spreading malicious Twittermessages [33], generating malicious URLs [1],generation of fake reviews [44] as well as textmessages [35]. Automated means of synthesizingsophisticated attack vectors reduces the manuallabor and provide phishers an opportunity to launchtargeted attacks on a much larger scale andmagnitude. This in turn increases the chances ofsucceeding in an attack.

5.3 Emails as Attack Vectors

Emails are the most common and preferredmethod for social engineering attacks. [11]describes the modus operandi and the structure ofa common phishing email. Researchers have alsodelved deeper into the attributes and underlyingpsychological features that cause phishing andsocial engineering attacks to be successful in [14],[34], [16]. Techniques for automatic generation ofsynthetic emails have been discussed in [2], [7], butintroducing attributes of deception into legitimateemails is a non-trivial task [9], [5], [29].

6 Discussion and Conclusions

We revisit two major error trends observed inthe evaluation of our word and character basedgeneration models. First, repetitions of tags andwords in the generated text body. A sample

sentence generated by the word-based languagemodel - “The corres ent ent ent Also ent , entI we can operating a gift to ensure, are thatextent will is a links are not ent” - demonstratessuch behavior. While, we hypothesized such abehavior at larger text lengths, the brittleness ina model which uses characters or words as unitsfor text generation can be observed for shorter textsequence generation as well. We believe, that thenature of the input on which the system is beingmodeled and the temperature (τ ) parameter usedfor sample generation play an important role in thisbehavior of the predictive model.

While the RNN model generated text with ‘some’malicious intent in them - the examples shownabove are just a few steps from being coherentand congruous. We designed an RNN based textgeneration system for generating targeted attackemails which is a challenging task in itself and anovel approach to the best of our knowledge. Theexamples generated however suffer from randomstrings and grammatical errors. We identify afew areas of improvement for the deep generationsystem - reduction of repetitive content as well asinclusion of more legitimate and phishing examplesfor analysis and model training. We would also liketo experiment with addition of topics and tags like‘bank account’, ‘paypal’, ‘password renewal’, etc.which may help generate more specific emails. Itwould be interesting to see how a generative RNNhandles topic based email generation problem.

Acknowledgements

This research was supported in part by NSF grantsCNS 1319212, DGE 1433817, DUE 1241772, andDUE 1356705. The study is also based upon work

supported in part by the U. S. Army ResearchLaboratory and the U. S. Army Research Officeunder contract/grant number W911NF-16-1-0422.

References

1. Bahnsen, A. C., Torroledo, I., Camacho, D.,& Villegas, S. (2018). DeepPhish: Simulatingmalicious AI. 2018 APWG Symposium on ElectronicCrime Research (eCrime), pp. 1–8.

2. Baki, S., Verma, R., Mukherjee, A., & Gnawali,O. (2017). Scaling and effectiveness of emailmasquerade attacks: Exploiting natural languagegeneration. Proceedings of the 2017 ACM onAsia Conference on Computer and CommunicationsSecurity, ACM, pp. 469–482.

3. Basnet, R. B., Mukkamala, S., & Sung, A. H.(2008). Detection of phishing attacks: A machinelearning approach. Soft Computing Applications inIndustry, Vol. 226, pp. 373–383.

4. Caputo, D. D., Pfleeger, S. L., Freeman, J. D.,& Johnson, M. E. (2014). Going spear phishing:Exploring embedded training and awareness. IEEESecurity & Privacy, Vol. 12, No. 1, pp. 28–38.

5. Carvalho, V. R. (2008). Modeling intention in email.Springer.

6. Chandrasekaran, M., Narayanan, K., & Upad-hyaya, S. (2006). Phishing email detection basedon structural properties. NYS Cyber SecurityConference, volume 3, pp. .

7. Chen, Y.-N. & Rudnicky, A. I. (2014). Two-stagestochastic email synthesizer. INLG, pp. 99–102.

8. Chen, Y.-N. & Rudnicky, A. I. (2014). Two-stagestochastic natural language generation for emailsynthesis by modeling sender style and topicstructure. INLG, pp. 152–156.

9. Das, A. & Verma, R. (2018). Automated emailgeneration for targeted attacks using naturallanguage. Proceedings of the Eleventh InternationalConference on Language Resources and Evalua-tion (LREC 2018), pp. .

10. David Campion (2018). Text Generationusing Bidirectional LSTM and Doc2Vecmodels. https://medium.com/@david.campion/

text-generation-using-bidirectional-lstm\

-and-doc2vec-models-1-3-8979eb65cb3a.Online.

11. Drake, C. E., Oliver, J. J., & Koontz, E. J. (2004).Anatomy of a phishing email. CEAS, pp. .

12. Enron Corpus (2015). Enron Email Dataset.https://www.cs.cmu.edu/~enron/enron_mail_

20150507.tar.gz. Online; accessed 13 February2018.

13. Feng, X., Liu, M., Liu, J., Qin, B., Sun, Y., &Liu, T. (2018). Topic-to-essay generation with neuralnetworks. IJCAI, pp. 4078–4084.

14. Ferreira, A. & Chilro, R. (2017). What tophish in a subject? International Conference onFinancial Cryptography and Data Security, Springer,pp. 597–609.

15. Ghazvininejad, M., Shi, X., Choi, Y., & Knight,K. (2016). Generating topical poetry. Proceedingsof the 2016 Conference on Empirical Methods inNatural Language Processing, pp. 1183–1191.

16. Giaretta, A. & Dragoni, N. (2017). Commu-nity targeted spam: A middle ground betweengeneral spam and spear phishing. arXiv preprintarXiv:1708.07342.

17. Goodfellow, I., Bengio, Y., Courville, A., &Bengio, Y. (2016). Deep learning, volume 1. MITpress Cambridge.

18. Graves, A. (2013). Generating sequenceswith recurrent neural networks. arXiv preprintarXiv:1308.0850.

19. Graves, A. & Schmidhuber, J. (2005). Framewisephoneme classification with bidirectional lstm andother neural network architectures. Neural networks: the official journal of the International NeuralNetwork Society, Vol. 18 5-6, pp. 602–10.

20. Guo, Z. L. (2003). Using mutual informationto identify new features for text documents ofvarious domains. Proceedings of the 17th PacificAsia Conference on Language, Information andComputation, pp. 372–379.

21. Henderson, M., Thomson, B., & Young, S. (2014).Word-based dialog state tracking with recurrentneural networks. Proceedings of the 15th AnnualMeeting of the Special Interest Group on Discourseand Dialogue (SIGDIAL), pp. 292–299.

22. Huber, M., Kowalski, S., Nohlberg, M., & Tjoa,S. (2009). Towards automating social engineeringusing social networking sites. Computational Sci-ence and Engineering, 2009. CSE’09. InternationalConference on, volume 3, IEEE, pp. 117–124.

https://medium.com/@david.campion/text-generation-using-bidirectional-lstm\-and-doc2vec-models-1-3-8979eb65cb3a





23. Jagfeld, G., Jenne, S., & Vu, N. T. (2018).Sequence-to-sequence models for data-to-text nat-ural language generation: Word-vs. character-based processing and output diversity. arXiv preprintarXiv:1810.04864.

24. Kiddon, C., Zettlemoyer, L., & Choi, Y. (2016).Globally coherent text generation with neural check-list models. Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing,pp. 329–339.

25. Kingma, D. P. & Ba, J. (2014). Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

26. Le, Q. & Mikolov, T. (2014). Distributed represen-tations of sentences and documents. Internationalconference on machine learning, pp. 1188–1196.

27. Li, J., Monroe, W., Ritter, A., Galley, M., Gao,J., & Jurafsky, D. (2016). Deep reinforcementlearning for dialogue generation. arXiv preprintarXiv:1606.01541.

28. Liu, C., He, S., Liu, K., & Zhao, J. (2018).Curriculum learning for natural answer generation.IJCAI, pp. 4223–4229.

29. Nohlberg, M. & Kowalski, S. (2008). The cycleof deception: a model of social engineeringattacks, defenses and victims. Second InternationalSymposium on Human Aspects of InformationSecurity and Assurance (HAISA 2008), Plymouth,UK, 8-9 July 2008, University of Plymouth, pp. 1–11.

30. Press, O., Bar, A., Bogin, B., Berant, J.,& Wolf, L. (2017). Language generation withrecurrent generative adversarial networks withoutpre-training. arXiv preprint arXiv:1706.01399.

31. Qian, Q., Huang, M., Zhao, H., Xu, J., & Zhu,X. (2018). Assigning personality/profile to a chattingmachine for coherent conversation generation.IJCAI, pp. 4279–4285.

32. Roberts, A., Engel, J., Raffel, C., Hawthorne,C., & Eck, D. (2018). A hierarchical latent vectormodel for learning long-term structure in music.arXiv preprint arXiv:1803.05428.

33. Seymour, J. & Tully, P. (2016). Weaponizing datascience for social engineering: Automated e2espear phishing on twitter. Black Hat USA, pp. 37.

34. Sheng, S., Holbrook, M., Kumaraguru, P., Cranor,L. F., & Downs, J. (2010). Who falls for phish?:a demographic analysis of phishing susceptibilityand effectiveness of interventions. Proceedings ofthe SIGCHI Conference on Human Factors inComputing Systems, ACM, pp. 373–382.

35. Shropshire, J. (2018). Natural language processingas a weapon. Proceedings of the 13th Pre-ICISWorkshop on Information Security and Privacy,volume 1, pp. .

36. Sidhaye, P. & Cheung, J. C. K. (2015). IndicativeTweet generation: An extractive summarizationproblem? Proceedings of the 2015 Conference onEmpirical Methods in Natural Language Processing,pp. 138–147.

37. Sommer, R. & Paxson, V. (2010). Outside theclosed world: On using machine learning fornetwork intrusion detection. 2010 IEEE Symposiumon Security and Privacy (SP), IEEE, pp. 305–316.

38. Sutskever, I., Martens, J., & Hinton, G. E. (2011).Generating text with recurrent neural networks.Proceedings of the 28th International Conference onMachine Learning (ICML-11), pp. 1017–1024.

39. Verma, R. & Hossain, N. (2013). Semantic featureselection for text with application to phishing emaildetection. International Conference on InformationSecurity and Cryptology, Springer, pp. 455–468.

40. Verma, R., Shashidhar, N., & Hossain, N. (2012).Detecting phishing emails the natural language way.European Symposium on Research in ComputerSecurity, Springer, pp. 824–841.

41. Verma, R., Shashidhar, N., & Hossain, N. (2012).Two-pronged phish snagging. Availability, Reliabilityand Security (ARES), 2012 Seventh InternationalConference on, IEEE, pp. 174–179.

42. Wiseman, S., Shieber, S. M., & Rush, A. M.(2018). Learning neural templates for text genera-tion. arXiv preprint arXiv:1808.10122.

43. Xie, S., Rastogi, R., & Chang, M. (2017). Deeppoetry: Word-level and character-level languagemodels for shakespearean sonnet generation.Technical report, Stanford University.

44. Yao, Y., Viswanath, B., Cryan, J., Zheng, H.,& Zhao, B. Y. (2017). Automated crowdturfingattacks and defenses in online review systems. arXivpreprint arXiv:1708.08151.

45. Yi, X., Sun, M., Li, R., & Yang, Z. (2018). Chinesepoetry generation with a working memory model.arXiv preprint arXiv:1809.04306.

46. Zhang, H., Lan, Y., Guo, J., Xu, J., & Cheng,X. (2018). Reinforcing coherence for sequence tosequence model in dialogue generation. IJCAI,

pp. 4567–4573.

modeling coherency in generated emails by …helped us prepare, process and evaluate our training...

Documents