htmlphish: enabling phishing web page detection by

9
HTMLPhish: Enabling Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis Chidimma Opara Teesside University Middlesbrough TS1 3BX, UK [email protected] Bo Wei Northumbria University Newcastle upon Tyne, UK [email protected] Yingke Chen Teesside University Middlesbrough, UK [email protected] ABSTRACT Recently, the development and implementation of phishing aacks require lile technical skills and costs. is uprising has led to an ever-growing number of phishing aacks on the World Wide Web. Consequently, proactive techniques to ght phishing aacks have become extremely necessary. In this paper, we propose HTMLPhish, a deep learning based data-driven end-to-end automatic phishing web page clas- sication approach. Specically, HTMLPhish receives the content of the HTML document of a web page and employs Convolutional Neural Networks (CNNs) to learn the seman- tic dependencies in the textual contents of the HTML. e CNNs learn appropriate feature representations from the HTML document embeddings without extensive manual fea- ture engineering. Furthermore, our proposed approach of the concatenation of the word and character embeddings allows our model to manage new features and ensure easy extrapolation to test data. We conduct comprehensive exper- iments on a dataset of more than 50,000 HTML documents that provides a distribution of phishing to benign web pages obtainable in the real-world that yields over 93% Accuracy and True Positive Rate. Also, HTMLPhish is a completely language-independent and client-side strategy which can, therefore, conduct web page phishing detection regardless of the textual language. KEYWORDS Phishing detection, Web pages, Classication model, Convo- lutional Neural Networks, HTML 1 INTRODUCTION e infamous phishing aack is a social engineering tech- nique that manipulates internet users into revealing private information that may be exploited for fraudulent purposes [1]. is form of cybercrime has recently become common because it is carried out with lile technical ability and sig- nicant cost [2]. e proliferation of phishing aacks is evident in the 46% increase in the number of phishing web- sites identied between October 2018 and March 2019 by the Anti-Phishing Working Group (APWG) [3]. Most phishing aacks are started by an unsuspecting Internet user merely clicking on a link in a phishing email message that leads to a bogus website. e impact of phishing aacks on individuals such as identity the, psychological, and nancial costs can be devastating. Problem Definition Recent research in phishing detection approaches has re- sulted in the rise of multiple technical methods such as aug- menting password logins [4], and multi-factor authentication [5]. However, these techniques are usually server-side sys- tems that require the Internet user to correspond with a remote service, which adds further delay in the communi- cation channel. Another popular phishing detection system that relies on a centralised architecture is the phishing black- list and whitelist methods [6]. A URL visited by an internet user will be compared with the URL in these lists in real- time. Although the list based methods tend to keep the false positive rate low, however, a signicant shortcoming is that the lists are not exhaustive, and they fail to detect zero-day phishing aacks. To mitigate these limitations, researchers have developed several anti-phishing techniques using ma- chine learning models as they are mostly client-side based and can generalise their predictions on unseen data. Machine learning-based anti-phishing techniques typi- cally follow specic approaches: (1) e required represen- tation of features is rstly extracted, then (2) a phishing detection machine learning model is trained using the fea- ture vectors. To extract the feature representation from the lexical and static components of a web page, the machine learning models rely on the assumption that the infrastruc- ture of phishing pages are dierent from legitimate pages. For example, in [7], phishing web pages are automatically detected based on handcraed features extracted from the URL, HTML content, network, and JavaScript of a web page. Furthermore, natural language processing techniques are currently used to extract specic features such as the num- ber of common phishing words, type of ngram, etc. from the components of a web page [8–10]. arXiv:1909.01135v3 [cs.CR] 15 May 2020

Upload: others

Post on 14-Nov-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

HTMLPhish: Enabling Phishing Web Page Detectionby Applying Deep Learning Techniques on HTML

AnalysisChidimma OparaTeesside University

Middlesbrough TS1 3BX, [email protected]

Bo WeiNorthumbria University

Newcastle upon Tyne, [email protected]

Yingke ChenTeesside UniversityMiddlesbrough, [email protected]

ABSTRACTRecently, the development and implementation of phishinga�acks require li�le technical skills and costs. �is uprisinghas led to an ever-growing number of phishing a�acks onthe World Wide Web. Consequently, proactive techniques to�ght phishing a�acks have become extremely necessary. Inthis paper, we propose HTMLPhish, a deep learning baseddata-driven end-to-end automatic phishing web page clas-si�cation approach. Speci�cally, HTMLPhish receives thecontent of the HTML document of a web page and employsConvolutional Neural Networks (CNNs) to learn the seman-tic dependencies in the textual contents of the HTML. �eCNNs learn appropriate feature representations from theHTML document embeddings without extensive manual fea-ture engineering. Furthermore, our proposed approach ofthe concatenation of the word and character embeddingsallows our model to manage new features and ensure easyextrapolation to test data. We conduct comprehensive exper-iments on a dataset of more than 50,000 HTML documentsthat provides a distribution of phishing to benign web pagesobtainable in the real-world that yields over 93% Accuracyand True Positive Rate. Also, HTMLPhish is a completelylanguage-independent and client-side strategy which can,therefore, conduct web page phishing detection regardlessof the textual language.

KEYWORDSPhishing detection, Web pages, Classi�cation model, Convo-lutional Neural Networks, HTML

1 INTRODUCTION�e infamous phishing a�ack is a social engineering tech-nique that manipulates internet users into revealing privateinformation that may be exploited for fraudulent purposes[1]. �is form of cybercrime has recently become commonbecause it is carried out with li�le technical ability and sig-ni�cant cost [2]. �e proliferation of phishing a�acks isevident in the 46% increase in the number of phishing web-sites identi�ed between October 2018 and March 2019 by the

Anti-Phishing Working Group (APWG) [3]. Most phishinga�acks are started by an unsuspecting Internet user merelyclicking on a link in a phishing email message that leads to abogus website. �e impact of phishing a�acks on individualssuch as identity the�, psychological, and �nancial costs canbe devastating.

Problem DefinitionRecent research in phishing detection approaches has re-sulted in the rise of multiple technical methods such as aug-menting password logins [4], and multi-factor authentication[5]. However, these techniques are usually server-side sys-tems that require the Internet user to correspond with aremote service, which adds further delay in the communi-cation channel. Another popular phishing detection systemthat relies on a centralised architecture is the phishing black-list and whitelist methods [6]. A URL visited by an internetuser will be compared with the URL in these lists in real-time. Although the list based methods tend to keep the falsepositive rate low, however, a signi�cant shortcoming is thatthe lists are not exhaustive, and they fail to detect zero-dayphishing a�acks. To mitigate these limitations, researchershave developed several anti-phishing techniques using ma-chine learning models as they are mostly client-side basedand can generalise their predictions on unseen data.

Machine learning-based anti-phishing techniques typi-cally follow speci�c approaches: (1) �e required represen-tation of features is �rstly extracted, then (2) a phishingdetection machine learning model is trained using the fea-ture vectors. To extract the feature representation from thelexical and static components of a web page, the machinelearning models rely on the assumption that the infrastruc-ture of phishing pages are di�erent from legitimate pages.For example, in [7], phishing web pages are automaticallydetected based on handcra�ed features extracted from theURL, HTML content, network, and JavaScript of a web page.Furthermore, natural language processing techniques arecurrently used to extract speci�c features such as the num-ber of common phishing words, type of ngram, etc. from thecomponents of a web page [8–10].

arX

iv:1

909.

0113

5v3

[cs

.CR

] 1

5 M

ay 2

020

While the above approaches have proven successful, theynevertheless are prone to several limitations, particularly inthe context of HTML analysis: i. inability to accommodateunseen features: As the accuracy of existing models dependson how comprehensive the feature set is and how imper-vious the feature set remains to future a�acks, they willbe unable to correctly detect new phishing web pages withevolved content and structure without a regular update ofthe feature set. ii. �ey require substantial manual featureengineering: Existing phishing detection machine learningmodels require specialised domain knowledge in order to as-certain the needed features suitable to each task (e.g., numberof white spaces in the HTML content, number of redirects,and iframes, etc.). �is is a tedious process, and these hand-cra�ed features are o�en targeted and bypassed in futurea�acks. It is also challenging to know the best features forone particular application.

To address the above issues, we propose HTMLPhish, adeep learning based data-driven end-to-end automatic phish-ing web page classi�cation approach. Speci�cally, HTML-Phish uses both the character and word embedding tech-niques to represent the features of each HTML document.�en Convolutional Neural Networks (CNNs) are employedto model the semantic dependencies.

�e following characteristics highlight the relevance ofHTMLPhish to web page phishing detection:

(1) HTMLPhish analyses HTML directly to help reserveuseful information. It also removes the arduous task requiredfor the manual feature engineering process.

(2) HTMLPhish takes into consideration all the elements ofan HTML document, such as text, hyperlinks, images, tables,and lists, when training the deep neural network model.

We experimentally demonstrate the signi�cance of char-acter and word embedding features of HTML contents indetecting phishing web pages. We then propose a state-of-the-art HTML phishing detection model, in which thecharacter and word embedding matrices are concatenatedbefore employing convolutions on the represented features.Our proposed approach ensures an adequate embedding ofnew feature vectors that enables straightforward extrapo-lation of the trained model to test data. Subsequently, weconduct extensive evaluations on a dataset of over 50,000HTML documents collected over two months. �is ensuresour evaluation se�ings reproduces real-world situations inwhich models are applied to data generated up to the presentpoint and applied to new data.

We summarise the main contributions of this paper asfollows:

• Di�erent from existing methods, our proposed model,HTMLPhish, to the best of our knowledge, is the �rstto use only the raw content of the HTML document

of a web page to train a deep neural network modelfor phishing detection. Manual feature engineer-ing is reduced as HTMLPhish learns the represen-tation in the features of the HTML document, andwe do not depend on any other complicated or spe-cialist features for the task. Our proposed approachtakes advantage of the word and character embed-ding matrix to present a phishing detection modelthat automatically accommodates new features andis therefore easily applied to test data.• We conduct extensive evaluations on a dataset of

more than 50, 000 HTML documents collected intwo months. �e distribution of the instances inour dataset is similar to the ratio of phishing andlegitimate web pages found in the real-world. �isensures that our evaluation metrics and results arerelevant to existing systems.• Furthermore, we carried out a longitudinal study

on the e�ciency HTMLPhish to infer the maximumretraining period, for which the accuracy of the sys-tem does not reduce. Our result only recorded aminimal 4% decrease in accuracy on the test data.�is con�rms that HTMLPhish remains reliable andtemporally robust over a long period.

We organised the remainder of the paper as follows: thenext section provides an overview of related works on pro-posed techniques of detecting phishing on web pages. Sec-tion 3 gives the prior knowledge on Convolutional NeuralNetworks, and Section 4 provides an in-depth description ofour proposed model. Section 5 elaborates on the dataset col-lection, while the detailed results on the evaluations of ourproposed model are found in Section 6. Finally, we concludeour paper in Section 7.

2 RELATEDWORKSIn this section, we address two most closely related topicsto our work: the phishing web page detection using featureengineering and the Deep Learning method (especially forNLP).

Feature Engineering for Phishing Web PageDetection�ese techniques extract speci�c features from a web pagesuch as JavaScript, HTML web page, URL, and network fea-tures. �ese are fed into machine learning algorithms tobuild a classi�cation model. �ese machine learning tech-niques di�er in the type of heuristics and number of featuresets used and the optimisation algorithm applied to the ma-chine learning algorithm. �ese techniques are based on thefact that both the phishing and benign web pages have a

2

di�erent content distribution of extracted features. �e ac-curacy of heuristics and machine learning-based techniquescritically depends on the type of features extracted, and themachine learning algorithm applied. Many phishing detec-tion techniques have been built on di�erent proposed featuresets.

Varshney et al [11] proposed LPD, a client-side based webpage phishing detection mechanism. �e strings from theURL and page title from a speci�ed web page is extractedand searched on the Google search engine. If there is amatch between the domain names of the top T search resultsand the domain name of the speci�ed URL, the web page isconsidered to be legitimate. �e result from their evaluationsgave a true positive rate of 99.5%.

Smadi et al. [12] proposed a neural network model thatcan adapt to the dynamic nature of phishing emails usingreinforcement learning. �e proposed model can handlezero-day phishing a�acks and also mitigate the problem ofa limited dataset using an updated o�ine database. �eirexperiment yielded a high accuracy of 98.63% on ��y featuresextracted from a dataset of 12,266 emails.

�e selection of features from various web page elementscan be an expensive process from security risk and techno-logical workload angle. For example, it can be prolongedand somewhat problematic to extract speci�c feature sets.Besides, it needs specialist domain expertise to de�ne whichfeatures are essential.

Deep LearningDue to its performance in many applications, Deep Learn-ing has a�racted increased interest in recent years [13–15].�e core concept is to learn the feature representation fromunprocessed data instantaneously without any manual fea-ture engineering. Under this premise, we want to use DeepLearning to detect phishing HTML content by directly learn-ing how features from the raw HTML string is representedinstead of using specialist features that are manually engi-neered.

As we want to train our Deep Learning networks usingtextual features, it is, therefore, essential to discuss NLPas it relates to Deep Learning. Deep learning techniqueshave been successful in a lot of NLP tasks, for example, indocument classi�cation [16], machine translation [17], etc.Recurrent neural networks (e.g., LSTM [18]) have been ex-tensively applied due to their ability to exhibit temporalbehaviour and capture sequential data. However, CNN hasbecome brilliant substitutes for LSTMs, especially showingexcellent performance in text classi�cation and sentimentanalysis as CNN learns to recognize pa�erns across space[19].

Very few a�empts have been made to use Deep Learningto detect phishing web pages using web page components.Bahnsen et al. [20] proposed a phishing classifying schemethat used features of the URLs of a web page as input andimplemented the model on an LSTM network. �e resultsyielded gave an accuracy of 98.7% accuracy on a corpus of 2million phishing and legitimate URLs. �e authors of [21]proposed a CNN based model which combines the outputsof two Convolutional layers to detect malicious URLs.

However, our review did not �nd any existing approachthat detects malicious phishing web pages using only HTMLdocuments on Deep Learning. HTMLPhish learns the se-mantic information present only in the character and wordsin an HTML document to determine the maliciousness ofthe web page. Our thorough analysis shows that phishingweb pages can be detected using only their HTML documentcontent.

3 PRELIMINARIESWe de�ne the problem of detecting phishing web pages usingtheir HTML content as a binary classi�cation task for predic-tion of two classes: legitimate or phishing. Given a datasetwith T HTML documents {(html1,y1), ..., (htmlT,yT)}, wherehtmlt for t = 1, . . . , T represents an HTML document ,while yt ∈ {0, 1} is its label. yt = 1 corresponds to a phish-ing HTML document while yt = 0 is a legitimate HTMLdocument.

Deep Neural Network for Phishing HTML DocumentDetection�e deep neural network that underlies HTMLPhish is a Con-volutional Neural Network (CNN). To detail a basic CNN forHTML document classi�cation, an HTML document is com-prised of a string of characters or words. Our goal is to obtainan embedding matrix html →s ϵRmaxlen×d , in a way that sis made up of sets of adjoining inputs si ∈ (1, 2, ...,maxlen)in a string, in which the input can be individual characters orwords from the HTML document. Each input is subsequentlytransformed in an embedding siϵR

d is the ith column of Sand the d-dimension is the vector size which is automati-cally initialized and learnt together with the remainder ofthe model.

In this paper, the embedding matrix was automaticallyinitialised, and for parallelisation, all sequences were paddedto the same length maxlen.

�e CNN performs a convolution operation ⊗ over sϵRmaxlen×d

using:ci = f (M ⊗ si :i+n−1 + bi )

followed by a non-linear activation where bi is the bias,M is the convolving �lter and n is the kernel size of theconvolution operation. A�er the convolution, a pooling step

3

is applied (which in our model is the Max Pooling) in orderto decrease the feature dimension and determine the mostimportant features.

�e CNN is capable of exploiting the temporal relationof n kernel size in its input using the �lter M to convolveon each segment of n kernel size. A CNN model typicallycontains several sets of �lters with di�erent kernel sizes (n).�ose are the model hyperparameters that are set by theuser. In this deep neural network, the convolution layer isusually followed by a Pooling layer. �e features from thePooling layer are then passed to dense layers to perform therequired classi�cation. �e entire network is then trained byusing backpropagation.

Note: In order to di�erentiate our state-of-the-art modelfrom the baseline models, for the rest of this paper, wewill use the term HTMLPhish-Full to indicate HTMLPhishtrained with the proposed model unless otherwise stated,while HTMLPhish-Character and HTMLPhish-Word repre-sent the deep neural network model using only the characterand word embedding respectively.

4 THE PROPOSED MODELIn this section, we elaborate on the architecture of our pro-posed deep neural network model HTMLPhish-Full. �enetwork architecture seen in Figure 3 shows HTMLPhish-Full has two input layers. �e �rst input layer processes theraw HTML document into an embedding matrix made upof character-level feature representations, while the secondinput layer does the same with words. �ese two branchesare concatenated in a dense layer called the Concatenationlayer. �erefore, the embedding matrix in this model is thesum of the character-level embedding matrix and the wordembedding matrix Cem +Wem where Cem →c ϵRmaxlen1×d ,andWem →w ϵRmaxlen2×d . �e features in the Concatena-tion layer allows the preservation of the original informationin the HTML content. In the concatenation layer, the contentof both embedding layers are put alongside each other toyield a 3 dimensional layer [Cem +Wem →(None, 180, 100)+ (None, 2000, 100) = (None, 2180, 100)].

To generate the character-level embedding matrix Cem ,the model learns an embedding, which takes the characteris-tics of the characters in an HTML document. To do so, allthe distinct characters, including punctuation marks in thecorpus, are listed. We obtained 167 unique characters. We setthe length of the sequencesmaxlen1 = 180 characters. EveryHTML document with strings greater than 180 charactersis cut from the 180th character, and any HTML documentwith characters smaller than 180 characters would be paddedup to 180 with zeroes. Before each character in our workis embedded into a d-dimensional vector, we conduct a to-kenization on the characters in the HTML document and

segment the characters into tokens as shown in Figure 1. Anindex is associated with each token before being applied toa d-dimensional character embedding vector where d is setat 100, which is automatically initialised and learnt togetherwith the remainder of the model. To facilitate its implemen-tation, each HTML document html is transformed into amatrix, html →c ϵRmaxlen1×d , where d = 100 and maxlen1= 180.

For the word embedding matrix Wem , �rstly, the rawHTML document is processed into word-level representa-tions by the word embedding layer. To achieve this, all thedi�erent words in the HTML document of the training cor-pus are listed using the following approach: An HTML docu-ment is split into individual words while treating all punctu-ation characters as separate tokens. For example, as shownin Figure 1, <!DOCTYPE html>, will be split into [′<′, ′!′,′DOCTYPE ′, ′html ′]. We surmise that punctuation marksprovide important information bene�ts for phishing HTMLdocument detection since punctuation marks are more preva-lent and useful in the context of HTML documents than or-dinary languages. HTML contains a sequence of markuptags that are used to frame the elements on a website. Tagscontain keywords and punctuation marks that de�ne theforma�ing and display of the content on the Web browser.�e listed unique words are used to create a dictionary whereevery word becomes a feature. We obtained about 321,009unique words in our dataset. We also padded the HTML doc-uments to make the lengths of the HTML documents uniformin terms of number of words (maxlen2 = 2000). Each uniqueword is then embedded into a d-dimensional vector, where dis set at 100, which is automatically initialised and learnedtogether with the remainder of the model. All the HTMLdocuments are converted to their respective matrix repre-sentation (maxlen2 ×d), on which the CNN is applied whered = 100 andmaxlen2 = 2000. Figure 1 shows an overview ofthe character and word embedding layer.

Word EmbeddingCharacter Embedding

< ! D O < !DOCTYPE

html

1 54 5 83 1 54 4 6

0.02 0.15 0.09 0.880.58 0.69 0.17 0.430.27 0.65 0.41 0.56

0.01 0.67 0.78 0.440.98 0.34 0.42 0.590.81 0.26 0.03 0.11

<!DOCTYPE html><htmlclass="no_js"id="facebook"

Tokens

Sequence of Integers

Embedding Matrix

Figure 1: Con�guration of the Embedding Layer

4

Table 1: HTML Documents Used in this Paper

Dataset D1 D2Date generated 11 - 18 Nov, 2018 10 -17 Jan, 2019Legitimate Web Pages 23,000 24,000Phishing Web pages 2,300 2,400Total 25,300 26,400

We can now introduce Convolutionary layers using theHTML document matrix (for all the HTML documents st∀t =1, ...,T ) as the corpus. We applied 32 Convolutionary �ltersMϵRd×n where n 8. �e Max-Pooling layer whose featuresare then passed to a 10 unit dense layer comes a�er theConvolutionary �lters. �e dense layer, which is regularisedby dropout, �nally connects to a Sigmoid layer. �en usingthe ADAM optimisation algorithm [22], we train the modelthrough backpropagation.

Baseline Models�e baseline models, HTMLPhish-Character and HTML-Phish-Word, whose architectures are detailed in Figure 3, areCNN models trained either on character-level embeddings orword-level embeddings, respectively. �e embedding matri-ces described above are applied to 32 Convolutionary �ltersMϵRd×n where n 8. �e next layer a�er the Convolution-ary �lters is the Max-Pooling layer, whose features are thenpassed to a 10 unit dense layer. �e Dense layer, which alsois regularised by dropout, �nally connects to a Sigmoid layer.Also, the models are trained through backpropagation usingthe ADAM optimisation algorithm.

5 DATASETData collection plays an essential role in phishing web pagedetection. In our approach, we collated HTML documentsusing a web crawler. We used the Beautiful Soup [23] libraryin Python to create a parser that dynamically extracted theHTML document from each �nal landing page. We chose touse Beautiful Soup for the following reasons:

(1) it has functional versatility and speed in parsing HTMLcontents, and

(2) Beautiful Soup does not correct errors when analysingthe HTML Document Object Model (DOM). �e HTML doc-uments in our corpus include all the contents of an HTMLdocument, such as text, hyperlinks, images, tables, lists, etc.Figure 2 shows an overview of the data collection stage.

Data CollectionSince phishing campaigns follow temporal trends in the com-position of web pages, the earliest data obtained should al-ways be used for training and the most recent data collectedfor testing [24]. Di�erent phishing pages created during

the same time may probably have the same infrastructure.�is could exaggerate an over-trained classi�cation model’spredictive output. To ensure our evaluation se�ings repro-duces real-world situations in which models are applied ondata generated up to the present point and applied on newweb pages, we collected a dataset of HTML documents fromphishing and legitimate web pages over 60 days.

Also, to ensure the deployability of our model to real-wordsystems, our data set is required to provide a distribution ofphishing to benign web pages obtainable on the Internet inthe real-world (≈ 10/100) [25, 26]. Given that when a bal-anced dataset (1/1), is used, the results can yield a baselineerror [27]. Consequently, our training dataset D1 consistingof HTML documents from 23,000 legitimate URLs and 2,300phishing URLs was collected between 11 November 2018 to18 November 2018. D1 dataset was used to train and vali-date the three di�erent variants of our model (HTMLPhish-Character, HTMLPhish-Word, and HTMLPhish-Full). From10 January 2019 to 17 January 2019, testing data set D2 con-sisting of HTML document from 24,000 legitimate URLs and2,400 phishing URLs were generated.

Note that D1 ∩ D2 = ∅. Also, our testing dataset D2,is slightly larger than our training dataset D1. �is is be-cause learning with fewer data, and having decent tests ona broader test data means that the detection technique isgeneralised. �is ensures that the features and model ofclassi�cation include speci�c features from legitimate andphishing web pages and that the approach can be appliedto the vast number of online Web pages. In total, our cor-pus was made up of 47,000 legitimate HTML documents and4,700 phishing HTML documents, as shown in Table 1.

�e legitimate URLs were drawn from Alexa.com’s top500,000 domains, while the phishing URLs were gatheredfrom continuously monitoring Phishtank.com. �e webpages in our dataset were wri�en in di�erent languages.�erefore, this does not limit our model to only detectingEnglish web pages. We manually sanitised our corpus toensure no replicas or web pages that are pointing to emptycontent. Alexa.com o�ers a top list of working websites thatinternet users frequently visit, so it is an excellent source tobe used for our aim.

6 EVALUATION OF HTMLPHISH VARIANTSExperimental SetupTable 2 details the selected parameters we found gave the bestperformance on our dataset bearing in mind the unavoidablehardware limitation for our proposed HTMLPhish variants:

a. HTMLPhish-Characterb. HTMLPhish-Wordc. HTMLPhish-Full

5

Data Collection

Tokenization

Length padding

Embedding

Convolutional Filters

Sigmoid Layer

Dense Layer

OutputLabel

User Web page<DOCTYPE html><html class="no-js" dir="ltr" lang="en" xmlns="http://www.w3.org/1999/xhtml">

Extract HTML

Preprocessing

Deep Neural Network

Figure 2: A Schematic Overview of the Stages Involved in Our Proposed Model

Input HTML Document

Character EmbeddingNo. Unique Charc X Maxlen1 X

100

Word EmbeddingNo. Unique Words X Maxlen2 X

100

32 Convolutional FiltersWith 8 Kernel Sizes

32 Convolutional FiltersWith 8 Kernel Sizes

Max Pooling Max Pooling

Dense Layer (10 Units)Activation = ReLU

Dense Layer (10 Units)Activation = ReLU

Input HTML Document

HTML Document

ClassificationSigmoid Layer

HTML Document

ClassificationSigmoid Layer

HTMLPhish-Character HTMLPhish-Word

Character EmbeddingNo. Unique Charc X Maxlen1

X 100

Input HTML Document

Cem Embedding

Concatenated Character and Word representations

Max Pooling

Dense Layer (10 Units)Activation = ReLU

HTML Document

ClassificationSigmoid Layer

Input HTML Document

Word EmbeddingNo. Unique Words X Maxlen2

X 100

Wem Embedding

32 Convolutional FiltersWith 8 Kernel Sizes

HTMLPhish-Full

Figure 3: �e Overall Architecture of HTMLPhish Variants

6

Table 2: HTMLPhish-Full Deep Neural Network

Layers Values ActivationEmbedding Dimension = 100 -Convolution Filter = 32, Filter Size =

8ReLU

Max Pooling Pool Size = 2 -Dense1 No. of Neurons = 10,

Dropout = 0.5ReLU

Dense2 No. of Neurons = 1 SigmoidTotal Number ofTrainable Parameters

412,388,597 -

�e three CNN models were implemented in Python 3.5on a Tensor�ow backend and a learning rate of 0.0015 in theAdam optimizer [22]. �e batch size for training and testingthe model were adjusted to 20.

All HTMLPhish and baseline experiments were conductedon an HP desktop with Intel(R) Core CPU, Nvidia �adroP600 GPU, and CUDA 9.0 toolkit installed.

Evaluation MetricsBecause of the severely imbalanced nature of our dataset,we evaluated the performance of our models in terms of theArea under the ROC Curve (AUC). We also used the receiveroperating characteristic (ROC) curve in our evaluation. �eROC curve is a probability curve, while the AUC depictshow much the model can distinguish between two classes,which for our model is - legitimate or phishing. �e higherthe AUC value, the be�er the performance of the model.�e ROC curve is plo�ed with the true positive rate (TPR)against the false positive rate (FPR) where TPR = (T P )

(T P+FN )and FPR = (F P )

(T N+F P ) . Where TP, FP, TN, and FN stand for thenumbers of True Positives, False Positives, True Negatives,and False Negatives, respectively.

Additionally, we employed the precision, True PositiveRate, and F-1 score metrics to evaluate the performance ofHTMLPhish and the baseline models. �e True Positive Ratecomputes the ratio of phishing HTML documents that aredetected by the models. In contrast, the precision metricscompute the ratio of detected phishing HTML documentsthat are actual phishes to the total number of detected phish-ing HTML documents.

Overall ResultTo record the performance of HTMLPhish-Full and the base-line models on the D1 dataset, we split the dataset into 80%for training, 10% for validation, and 10% for testing. Also, tak-ing cognizance of how our data is severely imbalanced, weensured we manually shu�ed the datasets before training.

�e ROC curves of HTMLPhish and its variants are shownin Figure 4. From the result detailed in Table 3, in gen-eral, HTMLPhish-Full signi�cantly outperforms the othertwo variants: HTMLPhish-Character, and HTMLPhish-Word.While HTMLPhish-Character and HTMLPhish-Word havesimilar performances, HTMLPhish-Full takes advantage ofthe strengths of both and produces more consistently be�erresults. Also, HTMLPhish-Full o�ered a signi�cant jumpin AUC over the other variants, while HTMLPhish-Wordperforms slightly worse amongst the three.

On the D1 dataset, HTMLPhish-Full provided a 98% accu-racy and 2% False Positive Rate. �e minimal False PositiveRates indicates the ratio of legitimate web pages, which areincorrectly identi�ed as a phish. �is is helpful when themodel will be deployed in real-world scenarios as users willnot be inappropriately blocked from accessing legitimateweb pages.

Considering the computational complexity of HTMLPhish-Full, it can be seen that on a dataset of over 25,000 HTMLdocuments, HTMLPhish-Full can be speedily trained within7 minutes. Once trained, HTMLPhish-Full can evaluate anHTML document in 1.4 seconds.

Comparison with State-Of-The-Art TechniquesWe compared HTMLPhish-Full with the methodology, speed,and performance of existing state-of-the-art models in [20]and [28]. [28] is a Deep Neural Network with multiple layersof CNNs that takes as input word tokens from a URL todetermine the maliciousness of the associated web page. Onthe other hand, [20] takes as input the character sequenceof a URL and models its sequential dependencies using Longshort-term memory (LSTM) neural networks to classify aURL as phishing or benign. We applied these techniques tothe HTML documents in the D1 dataset and also tested themon the D2 dataset.

From the result detailed in Table 3 and Table 4, HTMLPhish-Full provides be�er precision, recall and comparable accu-racy against the existing state-of-the-art models. �e perfor-mance of HTMLPhish-Word and [28] can be a�ributed to thefact that it is trained on a de�nite dictionary of words fromthe training data. �erefore it will be unable to obtain usefulembeddings for new words in the test data. HTMLPhish-Character and [20] perform be�er with respect to the AUCmetric because the individual character embedding CNN canlearn structural pa�erns in the HTML document and can alsoobtain feature representations for new words. �is makesit easy to be applied to the test data. In addition, due to thelimited number of characters, the scale of the CNN model us-ing the individual embedding character remains �xed whencompared to word-based model sizes. However, CNN modelsbuilt with individual character embeddings cannot exploit

7

Table 3: Result of HTMLPhish and Baseline Evaluations on the D1 dataset

Models Accuracy Precision True Positive Rates F-1 Score AUC Training timeHTMLPhish-Full 0.98 0.97 0.98 0.97 0.93 6.75 minsHTMLPhish-Word 0.94 0.93 0.94 0.93 0.88 10 minsHTMLPhish-Character 0.95 0.92 0.95 0.94 0.90 3.5 mins[28] 0.97 0.96 0.97 0.96 0.93 5.25 mins[20] 0.95 0.94 0.95 0.94 0.91 18 mins

Table 4: Result of HTMLPhish and Baseline Evaluations on the D2 dataset

Models Accuracy Precision True Positive Rates F-1 Score AUC Testing timeHTMLPhish-Full 0.93 0.92 0.93 0.91 0.88 9 secondsHTMLPhish-Word 0.90 0.87 0.91 0.88 0.73 107 secondsHTMLPhish-Character 0.91 0.89 0.91 0.89 0.77 7 seconds[28] 0.91 0.84 0.91 0.87 0.73 15 seconds[20] 0.90 0.90 0.92 0.90 0.78 112 seconds

structural information available in long sequences in theHTML document. It also disregards word borders and makesit challenging to di�erentiate special characters in the data.

Furthermore, CNN’s using only character level embeddingstruggles to di�erentiate information for scenarios wherephishing HTML documents try to imitate benign HTML doc-uments through small modi�cations to one or few words inthe HTML document[29]. �is is because the Convolutional�lters will likely yield similar output from a sequence of char-acters with a similar spelling. �erefore, CNNs using onlycharacter embeddings are not enough to obtain structuralinformation from the HTML document in detail. �at is thereason word embeddings must be taken into account. Con-sequently, HTMLPhish-Full takes advantage of both wordand character embedding matrices to accommodate unseenwords in the test data, and therefore yield a be�er result thanthe other variants and baseline models.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

True

pos

itive

rate

ROC Curve

Word + Character Embedding AUC = 0.88Character Embedding AUC = 0.77Word Embedding AUC = 0.73

Figure 4: �e ROC Curve of HTMLPhish Variants

Temporal Resilience�e techniques for implementing a phishing web page iscontinuously evolving due to emerging technology applica-tions for designing phishing web pages. �e evaluation ofthe resilience of this evolution is paramount for a phishingweb page detection technique. In this paper, we applied thelongitudinal study [30] by evaluating the accuracy of theHTMLPhish-Full using freshly collected data. �is study en-abled us to infer a maximum retraining period, for which theaccuracy of the system does not reduce. For a security sup-plier deploying HTMLPhish-Full in the wild, the retrainingtime frame can provide an approximate cost of maintenance.

Using the evaluation metrics detailed above, we comparedthe accuracy of HTMLPhish variants and baseline models onthe training data D1 with its accuracy when applied to thetest data D2 without retraining the model. From the resultsin Table 4, HTMLPhish-Full provided a 98% accuracy on thetraining dataset while yielding a 93% accuracy on the testdataset. �e result of our longitudinal study demonstratesthe readiness of HTMLPhish-Full for real-world deployment.HTMLPhish-Full will remain temporally robust, and will notneed retraining within at least two months.

7 CONCLUSIONIn this paper, we proposed HTMLPhish, a deep learningbased data-driven end-to-end automatic phishing web pageclassi�cation approach. HTMLPhish receives the HTML con-tent of a web page as input and applies CNNs to learn thesemantic dependencies in both the characters and words inthe HTML document in a jointly optimized network. Fur-thermore, we applied convolutions on a concatenation ofthe matrix of character and word embeddings in order toensure the e�ective embedding of new words in the testHTML documents. Our approach can learn context features

8

from HTML documents without requiring extensive manualfeature engineering.

We evaluated our model using a comprehensive datasetof HTML contents presented in a real-world distribution.HTMLPhish provided a high precision rate, showing a tem-porally stable result even when it was trained two monthsbefore being applied to a test dataset.

�e future work is to compare our model to feature engineering-based models that extract features only from the HTML doc-ument. Also, we intend to implement our model as a browserextension. �is will enable HTMLPhish to recognise phish-ing websites in real-time.

ACKNOWLEDGMENT�e authors acknowledge the Petroleum Technology Devel-opment Fund (PTDF), Nigeria for the funding and supportprovided for this work.

REFERENCES[1] J. Lopez and J. E. Rubio, “Access control for cyber-physical systems

interconnected to the cloud,” Computer Networks, vol. 134, pp. 46–54,2018.

[2] O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, “Machine learningbased phishing detection from urls,” Expert Systems with Applications,vol. 117, pp. 345–357, 2019.

[3] APWG, “Phishing activity trends report, 1st quarter 2018,” Tech. Rep.,2018.

[4] D. Cha�araj, M. Sarma, and A. K. Das, “A new two-server authentica-tion and key agreement protocol for accessing secure cloud services,”Computer Networks, vol. 131, pp. 144–164, 2018.

[5] T. Acar, M. Belenkiy, and A. Kupcu, “Single password authentication,”Computer Networks, vol. 57, no. 13, pp. 2597–2614, 2013.

[6] “Google safe browsing,” h�p://code.google.com/apis/safebrowsing/,accessed: 2019-09-30.

[7] C. Amrutkar, Y. S. Kim, and P. Traynor, “Detecting mobile maliciouswebpages in real time,” IEEE Transactions on Mobile Computing, no. 8,pp. 2184–2197, 2017.

[8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,no. 7553, p. 436, 2015.

[9] C. N. Gutierrez, T. Kim, R. Della Corte, J. Avery, D. Goldwasser,M. Cinque, and S. Bagchi, “Learning from the ones that got away:Detecting new forms of phishing a�acks,” IEEE Transactions on De-pendable and Secure Computing, vol. 15, no. 6, pp. 988–1001, 2018.

[10] E. Buber, B. Dırı, and O. K. Sahingoz, “Detecting phishing a�acks fromurl by using nlp techniques,” in International Conference on ComputerScience and Engineering (UBMK), 2017. IEEE, 2017, pp. 337–342.

[11] G. Varshney, M. Misra, and P. K. Atrey, “A phish detector using light-weight search features,” Computers & Security, vol. 62, pp. 213–228,2016.

[12] S. Smadi, N. Aslam, and L. Zhang, “Detection of online phishingemail using dynamic evolving neural network based on reinforcementlearning,” Decision Support Systems, vol. 107, pp. 88–102, 2018.

[13] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press,2016.

[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pa�ern recognition, 2016, pp. 770–778.

[15] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,2016, h�p://www.deeplearningbook.org.

[16] Y. Kim, “Convolutional neural networks for sentence classi�cation,”arXiv preprint arXiv:1408.5882, 2014.

[17] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations usingrnn encoder-decoder for statistical machine translation,” arXiv preprintarXiv:1406.1078, 2014.

[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[19] W. Yin, K. Kann, M. Yu, and H. Schutze, “Comparative study of cnn andrnn for natural language processing,” arXiv preprint arXiv:1702.01923,2017.

[20] A. C. Bahnsen, E. C. Bohorquez, S. Villegas, J. Vargas, and F. A.Gonzalez, “Classifying phishing urls using recurrent neural networks,”in Electronic Crime Research (eCrime), 2017 APWG Symposium on.IEEE, 2017, pp. 1–8.

[21] H. Le, Q. Pham, D. Sahoo, and S. C. Hoi, “Urlnet: Learning a urlrepresentation with deep learning for malicious url detection,” arXivpreprint arXiv:1802.03162, 2018.

[22] D. Kingma and J. Ba, “Adam: a method for stochastic optimization(2014),” arXiv preprint arXiv:1412.6980, vol. 15, 2015.

[23] L. Richardson, Beautiful Soup, 4th ed., 2017. [Online]. Available:h�ps://www.crummy.com/so�ware/BeautifulSoup.

[24] S. Marchal and N. Asokan, “On designing and evaluating phishingwebpage detection techniques for the real world,” in 11th {USENIX}Workshop on Cyber Security Experimentation and Test ({CSET} 18),2018.

[25] C. Whi�aker, B. Ryner, and M. Nazif, “Large-scale automatic classi�-cation of phishing pages.” in NDSS, vol. 10, 2010, p. 2010.

[26] Y. Zhang, J. I. Hong, and L. F. Cranor, “Cantina: a content-basedapproach to detecting phishing web sites,” in Proceedings of the 16thinternational conference on World Wide Web. ACM, 2007, pp. 639–648.

[27] E. Borgida and N. Brekke, “�e base rate fallacy in a�ribution andprediction,” New directions in a�ribution research, vol. 3, pp. 63–95,1981.

[28] B. Wei, R. A. Hamad, L. Yang, X. He, H. Wang, B. Gao, and W. L.Woo, “A deep-learning-driven light-weight phishing detection sensor,”Sensors, vol. 19, no. 19, p. 4258, 2019.

[29] W. Chu, B. B. Zhu, F. Xue, X. Guan, and Z. Cai, “Protect sensitive sitesfrom phishing a�acks using features extractable from inaccessiblephishing urls,” in 2013 IEEE International Conference on Communica-tions (ICC). IEEE, 2013, pp. 1990–1994.

[30] S. Marchal, G. Armano, T. Grondahl, K. Saari, N. Singh, and N. Asokan,“O�-the-hook: An e�cient and usable client-side phishing preventionapplication,” IEEE Transactions on Computers, vol. 66, no. 10, pp. 1717–1733, 2017.

9