unsupervised website visitor segmentation based on

Unsupervised website visitor segmentationbased on Convolutional Neural Networks and k-means

submitted in partial fulfillment for the degree of master of science

Dimitar Dimitrov12239496

master information studiesdata science

faculty of scienceuniversity of amsterdam

2019-07-05

Internal Supervisor External Supervisor 3rd supervisorTitle, Name Chang Li Michael Metternich Dr Maarten MarxAffiliation UvA, FNWI, IvI Company Supervisor UvA, FNWI, IvIEmail [email protected] [email protected] [email protected] .

Abstract

The digital era has made it possible for the end user to purchasealmost everything on the Internet. This led to the growth of thee-commerce industry, which in its turn pushed companies to searchfor ways to attract more customers. One such approach is by tailor-ing the content, which is presented to all users, known as ContentManagement. The next step is to optimize the content based on usersegments. Currently, those segments are created manually. How-ever, in this research we aim at clustering users based on keywordsfrom the URL of the pages they have visited. The keywords weretransformed via a Convolutional Neural Network (CNN) beforethe actual clustering. In doing so, we compared our results witha similar research [40], previously done, within BloomReach. Inaddition, by evaluating our CNN approach with an open dataset, wewere able to compare our results with the ones reported by JiamingXu et al. [42], whose work, laid in the foundation of this research.

Keywords

Web Mining, Clustering, Classification, Text Analysis, Convolu-tional Neural Networks

1 Introduction

Although the digital era has changed a lot of domains, in this paperwe focus on e-commerce. E-commerce represents the act of buyingor selling goods through the World Wide Web (WWW). [19] Assuggested by R. Cooley et al. [6] a lot of companies rely on theinternet to conduct and expand their business. To motivate thisstatement according to [36], in 2019 e-commerce is expected to beresponsible for over three trillion dollars in sales. This can lead usto the conclusion that shopping is no longer related to the physicalshop, rather going to the web. Bearing all this in mind, we canunderstand how important it is to make the online experience of theuser as pleasant as possible. An example for a possible improvementcan be to allow consumers to find what they need before theyeven know it. Although far-fetched, this is not a new practice, anexample for this is an article in Forbes from 2012. [12] In this articlethe highlights are taken by the retail company Target, who weretargeting their customers based on patterns, generate from theirdata. However, this should be done as caution, as it might frightenand draw the visitors away.

By living in a tech-advanced world we generate bits of datawith each action we take online. Having a closer look at the oneof the Marketing Theories - Consumer Decision Making Process[39], we can see that it consists of five different steps, where theactual purchase is only the fourth of it. The first three steps haveto do with the realization of the individual that he or she needs acertain product, followed by the search of similar items and theirevaluation. While performing those three steps, online, the useris generating their so-called digital identity by leaving footprints,such as a search history or visiting a certain category page. The

digital identity can then be completed with the actual realizationof a purchase.

By collecting, usually in the form of log files, the digital identityof their visitors companies can segment them into groups. Thegroups can be based on, for example a certain product group theyhave searched for.With those segments the business can target themin a more personal way, for instance by improving and adaptingthe content they are seeing online, known as content management,thus improving their digital experience.

This thesis focuses on the extraction of valuable informationfrom the online journey of the visitors. The project takes place inthe digital company: BloomReach. The company can be labelled astechnology provider, who aims at improving the experience of theend online-user and his relationship with the business. To be morespecific the research is built around one of BloomReach products —Experience Manager. The product intends to give more power tothe company, by allowing them to analyze and optimize content,based on the audience. The content optimization is done basedon user segments, which are created manually by administrators.We aim at automating the segments creation/suggestion process.As mentioned earlier, one way to store the digital footprint of theuser are the log files, which are also the type of data used in thisthesis. However, log files contain only the visited URL’s, withoutany label or class specification, or simply put – raw data, whetherthat stands only for URL’s or some extra information. We could,however, extract insightful information through ’text analysis’. Thisinformation can be used as an input for different algorithms orNeural Networks, however due to the lack of labels or class specifics,the approach must be unsupervised. Unsupervised algorithm standsfor an algorithm capable of learning from a dataset, without anylabels, and capable of finding patterns in it.[16]

1.1 Research QuestionsIn this work we investigate the application of Neural Networks,more specifically Convolutional Neural Networks (CNN), for un-supervised clustering of textual data extracted from log files. Thereason for focusing specifically on CNN’s will be covered in thefollowing sections.

However, our main motivation has to do with evaluating thework of Jiaming Xu et al. [42], which covers an interesting approachof training a CNN, in an unsupervised manner. The output of thenetwork is then used for clustering of the input. Motivated by this,as well as an in-house research, which will be described in Relatedwork, the main research question of this paper is:

RQ1: To what extent can the use of CNN outperformK-modes clustering in unsupervised user segmentation,based on keywords extracted from the online journey?

As the topic can quickly expand and overpass the time limitationof this project, and to improve evaluation, the work was split intwo parts, each of which will have its own set of sub-questionsquestion.

1

Information Studies - Data Science’19, July 2019, Amsterdam,The Netherlands Dimitrov, et al.

Part 1: CNN

• RQ2: How is our implementation, of the work of Jiaming Xuet al. , performing compared to reported results by them?

Part 2: Available Data in-house

• RQ3: How can we evaluate the performance and results withthe data in-house?

• RQ4: Does it make sense to scrape the visited page and usekeywords based on the scraped content?

• RQ5: To what extent is the use of scraped content affectingthe cluster performance and structure?

We will cover related work in chapter 2, while chapter 3 willintroduce the methodology. Finally, chapters 4 and 5 will focuson evaluating the approach and drawing conclusions. Appendix Aprovides extended evaluation.

2 Related Work

2.1 Web MiningPart of the work of R. Cooley et al. lies in the foundation of thiswork. In his work he introducedmultiple approaches for structuringthe raw data, coming from the internet, and motivated techniquesfor extracting value out of it. This subsection will be mainly basedon two of his papers:

• Data Preparation for Mining World Wide Web. BrowsingPatterns [7]

• GroupingWeb Page References into Transactions for MiningWorld Wide Web Browsing Patters [6]

In the first one, the authors are describing and discussing differ-ent Data Mining techniques, amongst which we have associationrules and clustering analysis. The latter one is the focus of this pa-per as it emphasizes the benefits of grouping similar users together.Based on those clusters companies can either develop a marketingstrategy or execute one by targeting customers both online andoffline. R. Cooley et al. share their theory that a certain web pagecan have one of two purposes for the end user - either navigationpage or a content one. However, the message, from this paper, isthe content regarding the user transaction and how to extract whatis actually relevant. The authors discuss three different modules,concerning the identification of specific transactions: ReferenceLength Module, Maximal Forward Reference Module, Time Win-dow Module. For this thesis, we will focus only on the MaximalForward Reference, initially proposed in [25]. This module takesinto account visitor ID and timestamp of the visit and the page. Thetransactions, based on this module, are not related to the time thevisitor spends on a page, rather on the order of visited pages. Anew transaction starts with a so-called forward reference, a pagenot in the current transaction, and a transaction finishes when thevisitors goes back to a page, which is already in the transaction. Agroup of consecutive URL visits will finish with a content page andthe pages leading to the end are navigational. An example from[6]: the sequence A,B,C,D,C,B,E,F,E,G would be split into 3 buckets:A,B,C,D; A,B,E,F; A,B,E,G. The content pages are D,F,G - the lastpages visited before going backwards.

2.2 Relevant work in-houseThe current phrasing of the main research question, has to do witha recent research done within the company. [40] As the researchquestion (RQ1) suggests, we aim at comparing the performance ofour approach using the same data.

In the previous approach, from now on referred to as baseline,the aim was to segment website visitors based on contextual dataretrieved from the visited URL’s. In order to achieve this, theyproposed the Automated Visitor Segmentation (AVS) pipeline con-sisting of seven steps, which included reading (1), filtering (2) andsorting the data (3), followed by identifying transactions (4) andextracting information from them (5). Finally, unnecessary datawas removed (6) and data was clustered (7).

The exact data used in their research, as well in ours, will befurther discussed in Section 3.1. However, as the available data wasmade out of raw log files, the main focus in their pipeline (Steps 1to 6) goes to data pre-processing and cleaning. Interesting in thepipeline is step 4, which has to do with a method initially proposedby Ming-Syan Chen et al.[25] - Maximal Forward Reference.

The output of the first six step would be similar to the datapresented in Figure 1. Regarding the last step of their pipeline - dataclustering, they provide an extensive literature overview on possibledirections and different algorithms capable of clustering data in anunsupervisedmanner, due to the lack of labels or class specifications.The final decision was made both based on the disadvantages andadvantages of the algorithms and on the type of data they had.This being said the baseline approach was built around K-Modesclustering, hence that the exact formulation of our RQ1. K-Modesis a variation of K-Means and was chosen, based on the followingpoints:

(1) The main functionality of the algorithm is to group the mostfrequent similar items.

(2) The algorithm supports both, numerical and categorical data.In the baseline they directly used the cleaned categoricaldata as an input for the algorithm.

(3) As the algorithm is related to the K-means, it allows flexibilityin the number of clusters, which is used as a parameter,which can be set based on the user.

The final output of the baseline approach would be structureddata, containing the cleaned output from the first six steps togetherwith numeric cluster labels.

We will be going back to the baseline model to discuss their finalresults in the Evaluation section.

2.3 Neural NetworksIn Section 3.1 Data Description we will further discuss the structureof the provided data. Despite this, we should point that the dataat hand represents sequences of words, without any categories orclasses attached to them, as well as arguably any context. As alreadyemphasized such data moves points to the use of unsupervisedlearning.

Neural Networks consist of sets of algorithms and are inspiredby the human brain and the neurons in it. In different sources, theneuron is regarded as either information carrier or the workingunit in the brain, which is also similar to its purpose in a NeuralNetwork. Simply put the neuron, in programming perspective, is

2

Information Studies - Data Science’19, July 2019, Amsterdam,The Netherlands

a unit which takes some input, performs calculation on this inputand sends out the result. The Neural Network is basically a set ofneurons connected together. Figure 1 is a simple Neural Network,created with [22].

Figure 1: Simple Neural Network Representation.

Shown in figure 1, there are arrows leading from the input layerto the output layer. For example, assume we have an image as aninput. Then the network will "evaluate" it pixel by pixel and thefinal output will be for instance a class label, such as dog. The "eval-uation" is not a simple summation rather a processing architectureinvolving different functions, set up variations and mathematics.However going deep into it is out of the scope of this research. Thesecond main algorithm involved in a Neural Network is the back-propagation. Simply put, it is how the network learns. Instead ofgoing forward, the network goes backward and the weights of eachneuron,the arrows in the figure, are adapted in order to minimizethe error. The error itself is based on the result of the forward prop-agation and the actual ground truth, which going back to our imageexample is the actual label of the image. To summarize, the networkwill learn, which are the important features per class/label. Thiscan be used to summarize supervised learning, where a predictionis made based on example.

Although from the basic explanation of Neural Networks wesee that they are in a way a supervised approach. They can alsobe used in an unsupervised manner in order to provide a betterrepresentation of the input data. Considering this, and as we onlyhave unlabeled data, we decided to investigate the possibilities ofusing the generated such a representation of our data and cluster it.

Although this approach is not new, there are certain points,which need to be taken into account. For example, Dundar et al.[8], suggest an approach which incorporates both k-means andNeural Networks, however they are using images and they havelabels. Besides for training, the unique count of the labels can beused instead of the exact number of clusters needed as a parameterfor the algorithm. Another usage of the labels can be to calculatethe accuracy and quality of the generated clusters, by comparingthe predicted ones against the ground truth.

Another important point to take into account is the choice NeuralNetwork, which is affected by multiple things such as the dataat hand and expected outcome. There are multiple sources andguidelines. For example, Angus et al., proposes Criteria for choosingthe best Neural Network [9] discussing this problem together withsolutions.

Considering this, a workaround is needed for the lack of labelsin regard to the training of the Neural Network. We also need toconsider the type of data, which was mentioned at the start of thesubsection. Starting from the latter one, as discussed in [13] shorttexts are limiting in the sense that they lack the syntax and grammarused in proper text. Secondly, short texts are lacking statisticalinformation needed for proper use of statistical approaches liketopic-modelling and as the authors state, such texts are ambiguous,thus hard to interpret. In our work, we are dealing with separatewords extracted from URL’s. Regarding the unsupervised trainingof Neural Networks Jiaming Xu et al. [42], propose, an approachworth investigating. In their paper [42], they suggest an architecturecapable of learning the most important features without the use ofany labels.

Motivated by their work, we decided to incorporate their ap-proach for training a Neural Network and investigate whether ornot we can firstly improve on their results and secondly make useof the approach for our own dataset.

3 Methodology

As already mentioned, this research is rather specific as in its basisit is constructed around data from one of the main products ofthe sponsoring company - Experience Manager. Another pointmaking our research specific is the fact that we focus only on theuse of CNN. Initially CNN’s were recognized and known for theirperformance in image related task, however as Wang et al. [41]suggest, they are capable of learning local features from words andphrases. The above statement also appears in the work of YoonKim [18], who compares the performance of different simple CNN’son seven tasks, two of which are sentiment analysis and questionclassification. The proposed method from Yoon Kim improved on 4out of those 7 tasks. In addition, the approach used for the trainingof the Neural Network is based on the work of Jiaming Xu et al.and as we also aim at comparing our results with theirs, we usedthe same Neural Network type - CNN.

The rest of this section is split into sub-sections, each aiming atproviding part of the whole process and will further built-on thecontent from above.

3.1 Data DescriptionThe available in-house data was generated from one of the com-pany’s products, which is a Content Management System - Experi-ence Manager. Whenever a client requests this product, the tool isdeployed based on their requirements. One of the things stored bythe system are log files of the interactions of the visitors with thewebsite.

Initially two datasets were provided, one from the sponsoringcompany’s website and one from a client website. Both datasetshad the same features per user, however the client dataset wasmultilingual, which differed from the initial idea to focus on contentin English, similar to the previous research.

This being said, the preferred dataset was generated and basedon the usage of the company’s website - bloomreach.com and con-tained the following information:

(1) Unique identifier, which is distinguishing site visits/sessions;3


(2) Unique user identifier, which is distinguishing users and isstored in a cookie;

(3) Location information, which consist of country, city, latitudeand longitude;

(4) Information about the day of the week the visit was made aswell as separate Timestamp per activity per user. This allowsto follow the path, described in section 2.1 Web Mining;

(5) Browser used;(6) Referrer page - the page which got the user to ’current’ page.

Example of referrer page can be google.com;According to the earliest time stamp, the first record is from

29th of August 2016 and the last record is from 15th of April 2019.In total the log file contains 12 610 512 rows. Based on the visitoridentifier there are 6 442 700 unique entries.

Sample, without the visitor identifier can be seen in Table 1.3.1.1 Data pre-processing As Karl Grooves [10] explains, logfiles are not initially created for usability analysis, which pointsto the need for data cleaning. This is also seen in the pipeline ar-chitecture of the baseline method, where 6 out of 7 steps are datapre-processing related. In this research we aim at comparing ourresults and preferably extend the baseline [40]. From this prospec-tive, we evaluated the pre-processing steps, part of the baseline,and although we used the same order of actions, such as reading,sorting etc., we tried to improve on them. The improvement, inour opinion, was in the way the keywords were extracted from theURL, per user. Figure 2 represents an example of a URL.

Figure 2: Sample URL Example.

The domain related part is ignored as it brings no value. Thevaluable information however is filtered via a set of regular ex-pression rules before it is tokenized, stemmed and filtered for stopwords. The example from figure 2 will give us customer, police,national as keywords. This combined with the correct grouping ofvisits, as explained in section 2.1, should be sufficient to allow us tosee the interest of the visitor and from there build a valuable visitorfeature.

The pre-processing will be touched again in the Evaluation, aswe compare the results from the baseline pre-processing and ours.

3.1.2 Web scraping Although the initial idea, as explained untilnow, was to focus primarily and mainly on using only the URL,we decided to add an additional data source in order to compareperformance. The summarized results can be seen in section 4Evaluation and the extended evaluation can be seen in AppendixA. Nowadays the importance of the URL has slightly fallen behind.For example, in a google search the majority of people will lookinto the provided summary of the page, rather than the specificlink itself, the content after the domain.

According to this, we can formulate a hypothesis that the actualcontent will bemore helpful when segmenting users, rather than theURL. To test the hypothesis, a framework was developed, capableof following a given URL and scraping the content. As Julia Kho

[17] explains, web scraping is a technique to access and extractinformation from a given website.

Our framework consisted of a so-called spider, the actual scraper,main repository and two separate repositories. The main repositorywas straightforward and was used to store the whole text contentfrom the page, excluding any navigational bars, comments andother non-related information. Besides the content, the extracteddate was also stored, in the case where the scrapped page did notcontain a date identifier. This was done for the sole purpose ofhaving a way to refresh the repository. Depending on the websiteat hand, the refresh can be done either after new system releaseor after a certain time period. In our case, the website is mainlyrelated to documentation, meaning the content would only changearound product updates or releases. The two separate repositorieswere stored in the form as dictionaries, as follows:

(1) {Scrapped URL: keywords from that page}(2) {Scrapped URL: summary of that page}

The keywords and the summaries were extracted based on rank-ing algorithm and the available textual content.[28] Important tomention is that there was a separate pre-processing step for thescrapped content. This was needed as the raw scrapped text alsocontained Hyper Text Markup Language (HTML) tags. Besides thissome of the pages were documentation related, we had code snip-pets mixed with text. The pre-processing here was based on a setof regular expression and web-related programming libraries.

The two repositories described above were merged with our userdata based on the visited URL’s, thus creating additional featuresfor each user. However, in the process of working we realized thatusing the summary of a page, will over-complicate things and doesnot make sense to use it with the current research in mind. Insteadit can be left for Future work. The keywords on the other handproved more useful as it is show in4 Evaluation.

3.2 MethodsThis section will be split into three main parts. The first part willcover the steps taken in order to prepare the data for the NeuralNetwork, whereas the second one will describe the idea behind themodel design. The third step will explain the steps taken in orderto expand our research. Initially we were only interested in usingthe keywords extracted from the URL’s, however we decided toinvestigate whether or not it would make more sense to use theactual content of the URL by scraping the actual page.

3.2.1 Network Input Following the pre-processing stepsdescribed in section 3.1.1 Data pre-processing, we produced aset of keywords per user. However, those keywords were stillrepresented as text and in order to be used by the Neural Network,we transformed the text into numeric values using a deep learninglibrary. The initial set of keywords was transformed into a sequencein which each word was replaced by an index value, based on wordindex dictionary. For example, ’bloomreach’ was replaced by 1,whereas ’apache’ was replaced by 228. The dictionary was createdaccording to the frequency of each word in the whole corpus.Lower indexes point to words which are more common and appearmore in the given corpus.[5] Based on the new sequence themaximal length was taken, which was used to generate a padded

4


Table 1: Data sample from the in-house data.

timestamp pageUrl NewVisit pageId2016-08-29 18:46:05.809 https://www.onehippo.org/library/administratio... True hst:pages/documentation2016-08-29 18:46:03.111 https://www.onehippo.com/en/digital-experience... True hst:pages/Digital-experience-platform2016-08-29 18:46:09.518 https://www.onehippo.org/ True hst:pages/home2016-08-29 18:46:11.279 https://www.onehippo.org/ True hst:pages/home2016-08-29 18:46:14.663 http://www.onehippo.com/connect/boston True hst:pages/boston

sequence for each of the elements in the corpus. Respectively if thegiven element’s length was lower, than the maximum, 0’s wereadded.

In parallel, an embedding matrix was initialized. This matrix isbased on word embeddings, vector representation of words learnedvia Neural Network. The word embeddings are publicly availablethrough the work of Mikolov et al.[24]

The next step is to combine the sequence matrix, which wasgenerated earlier, in combination with a weighting factor in orderto account for each feature (word) in our sequence. We ran testsusing all four approaches as a weighting factor, however we onlyreported the best-performing results.

• Binary: Having the whole corpus, evaluates entries (sepa-rate text/document) and returns 1 for each word from thecorpus, which is in the given entry. Respectively a 0 is re-turned if the word is not in the entry.

• Count: Following the same logic as the binary approach, itwould return 0 if the word is missing, however, if the wordappears it will return the number of occurrences. Importanthere is to account for stop words, such as: the, a etc.

• Frequency: 0 will be returned if the given word in not in theprocessed text/document. If the word appears it will returna proportion of the times the word appears against the totallength of the text/document.

The resulting matrix was then normalized in order to ensure thatall values have a common scale.[39] Following the normalization,the normalized matrix was combined with the embedding matrix.The last step of the pre-processing is to follow the approach fromthe baseline method and binarize the product of the two matrices.As it is based on vector representation, in some cases it would havea negative value, which after the binarization would be changed to0 and all positive ones will get the value 1 instead.

The following few sentences of this subsection will reflect onwhat was done in the baseline approach. According to their work,Jiaming Xu et al. are using a binarized (0 or 1) representation of theAverage Embedding, in order to train the Neural Network, figure 3.The binary representation is used instead of labels. In the baselinemodel they used TF-IDF as a weighting factor, for the features,in the Average Embedding. Term frequency - inverse documentfrequency (TF-IDF) is used to represent the importance of eachword with respect to the its occurrences in the text.[27] Rephrasing,TF assumes that if a word occurs a lot in a given text, then this wordshould be descriptive for the text. IDF on the other hand, reasonsthat if a word appears a lot in the given text/document, as well as inothers, then most probably this word is not unique for the text andbrings no meaning. Stop words, for example, which appear a lot ina text bring arguably no context about the content. High score for

TF-IDF means that this word is rare and specific for the documentat hand.

Figure 3: Jiaming Xu et al. proposed architecture. [42]3.2.2 Network Design This section will cover the design of ourNeural Network, together with reasoning for our decisions. Startingfrom the top - the final design, figure 4 represent the structure ofour current CNN.

Figure 4: Our proposed architecture.

We incorporated the same approach used in the works of JiamingXu et al. or Yoon Kim and trained our CNN on top of pre-trainedword vectors. The output from section 3.1.1 - simple keywords, isused as input for the steps described in 3.2.1 Network input andthen for the training of the network, instead of labels.

5


Our neural network starts with an embedding layer, which isa vital part when one is dealing with text in Neural Networks. Inour work we are not training our embeddings, rather we simplyload the embedding matrix, mentioned in section 3.1.1, as weights.[4] The explanation of Embedding Layer based on the Keras doc-umentation is rather vague [38] and simply states ’Turns positiveintegers (indexes) into dense vectors of fixed size’. Jason Brownlee[4] provides a good summary of what embedding’s are and theirpurpose. In a nutshell, their purpose is to present words as densevectors, which are based on the presentation of each of the wordsin a continuous vector space. This approach is a better alternativeof one-hot encoding, which has to do with representing each doc-ument, as a vector with the size of the vocabulary length, withmostly 0’s. One hot encoding is on the principle of whether theword is in the document or not - if it is not, 0 is applied.

Following different discussions and sources, such as the workof Nitish Srivastava et al. [33] on Dropout as a way to preventNeural Networks of over-fitting, we directly apply a Dropout layerto the output of the Embedding Layer. Again, going back to thedocumentation of Keras we can see the purpose of Dropout is torandomly set a portion of input units to 0 during training. Thesetting, dropping fraction, is a hyperparameter, which can be tuned.[37] As presented in figure 4 following the Dropout, we have threeconvolutional layers on the same level.

There are a few things we considered in this part of our networkdesign. As numerous sources explain, such as Nils Ackermann in[1] or Jason Brownlee [3], 2D CNN’s have been used in imageprocessing where the incoming input is of two-dimensional format.1D CNN’s, however, have been used for other task, such as NaturalLanguage Processing (NLP) and our case, where the input data isof different format. Regarding the use of three layers on the samelevel, it is important to go back to the idea behind the convolutionallayer, which is to simply apply filters or a set of filters, to an input.[2] To further build on the use of filters in NLP tasks, we relatedto the work of Siwei Lai et al. [20]. In the paper, the authors arguethat in earlier studies of CNNs in NLP, researchers would rely onfilters with fixed sized, however when using such fixed sizes, oneis prompt to either loose information, when the size is too small,or have as Siwei Lai et al. point out, an enormous parameter space,when larger size is used instead. With this in mind and motivatedby both the work of Yoon Kim [18] and Ye Zhang et al. [43] wemade use of a set of three filters, each with a different size. Figure 5is a shortened version of the work of Ye Zhang [43] and it showsthe idea behind the use of different filter sizes. Essentially, eachfilter will capture a different set of features. As an example, we canlook at the work of Yoon Kim [18], where he shows the formulafor generating a feature. A feature ci , coming from a given wordwindow, is given by equation 1:

ci = f (w ∗ xi :i+h−1 + b) (1)

Where f represents an activation function, non-linear, w standsfor the filter, which is applied to the givenwindow ofwords and thencombined with the bias. This is a single feature, single applicationof the filter. Once the filter is applied to all words, a feature map iscreated, based on all the window of words - equation 2.

c = [c1, c2, ..., cn−h+1] (2)

Figure 5: Shortened version based on the work of Ye Zhang.[43]

Following the features, we apply the Pooling. As Harsh Pokharna[26] explains, the idea of pooling is to reduce the spatial size ofthe representation and the number of features. The work of AlonJacovi [15] provides and extensive overview for understanding howCNN are used for Text Classification. Based on this work and otherreadings we settle for Global Max Pooling. The idea of max poolingis to retrieve the highest value from a feature map.

Following the concatenation of the results we finish with a com-bination of fully connected layer, followed by dropout and a denselayer used for the final prediction.

4 Evaluation

4.1 Experimental EnvironmentThe evaluation was conducted on a local machine equipped withIntel Core i7 (2,5 GHz) Processor, 16GB RAM (1600MHz DDR3) andmacOS Mojave (Version 10.14.5) as operating system. The devel-opment itself took place on the same machine, where all the hy-perparameters for the Neural Network were tuned via GridSearch,according to the machine’s specifications. The code was written inPython 3.7.3 (Anaconda Distribution). The Convolutional NeuralNetwork was developed via Keras, as it enables fast prototypingand testing. [30]

4.2 Model ComparisonIn this subsection we aim at comparing our work with the resultspresented in the work of Jiaming Xu et al. Following their evalu-ation, we compared the performance based on the following twometrics:

(1) Accuracy (ACC): The amount of correct prediction doneby the model.[35] In our case, we check the k-means labelsagainst the ground truth.

(2) Normalized Mutual Information (NMI): Measure describingthe relatedness of two variables. It measures how much one

6


of the variables is able to describe the other one. [11]. The endvalue is then normalized between 0 (no mutual information)and 1 (perfect correlation).[32]

The metrics focus on evaluating the quality of the clusters. Pleasenote that when using a clustering algorithm, e.g. K-means, each timeyou re-run the algorithm the labels and the assignments themselveswill change. Known as the assignment problem. To account forthis Jiaming Xu et al. used combinatorial optimization algorithm -Hungarian algorithm [5] [21].

The comparison is based on the publicly available, from Kaggle,Stack Overflow dataset [34], which consists of question titles andthe corresponding label. Table 2 shows the performance compari-son of the results reported by Jiaming Xu et al. against ours. Theirapproach is referred to as Self-Taught Convolutional Neural Net-works for Short Text Clustering (STCC). The results of STCC arebased on the average of 5 runs, with 100 clusters.[42]. Accordingly,we did the same.Table 2: Comparison of Model Performance based on StackOverflow Dataset.

Method ACC(%) NMI(%)

STCC 51.13 ± 2.80 49.03 ± 1.46Our Approach 52.83 ± 2.68 49.09 ± 2.32

Extended version of our results can be seen in Table 3. Accordingto our results from Table 3, we can state, with 95% Confidence, thatour Accuracy Mean would fall somewhere between [52.30, 53.36]and NMI Mean - [48.64, 49.55].

As can be seen in Table 2 and Table 3, we can conclude that wewere able to achieve an improvement in the final performance. Inour opinion the improvement is due to two things, one of which isthe network design. The other reason is the weighting factor usedin the creation of the training labels. In their work, Jiaming Xu et al.relied on TF-IDF as a weighting factor, whereas we are reportingthe scores based on Binary, as they it showed the best results.

4.3 Clustering comparisonIn this subsection we compare our results with the results from theresearch previously done within the sponsoring company [40]

The comparison will be split in several parts. The first part willaccount for the visual and computational performance of the ap-proaches used for extracting keywords from the URL. Following thekeywords extraction, the cluster results will be discussed. The pre-vious in-house research will be referred to as baseline and numberof clusters will be denoted by ’k’.

4.3.1 Keywords extraction The first takeaway from thecomparison is that our approach copes with the occurrencesof numeric values, via a set of regular expressions. An ex-ample URL: ’https://www.onehippo.org/7_8/library/architecture/hippo-cms-7-architecture.html’.

Another important point is that, compared to the baseline, wherethe domain variations are hard-coded, our approach automaticallyrecognizes the domain and focuses only on the important part, asshown in Figure 2. This however comes at an increase in executiontime, as shown in Table 4.

In addition, our approach is better at filtering the final set ofkeywords by not allowing symbols, such as ’:’ or URL related words,like ’http’ or ’www’. Sample example can be seen in Figure 6.

Figure 6: Comparison between Keywords extractionapproaches,emphasize on numerical values, symbols andURL related words.

4.3.2 Cluster Evaluation As we do not have ground truth labels,for the evaluation of the clusters we used the Silhouette Coefficient,proposed by Rousseuw [29]. This coefficient aims at providing agraphical aid to the interpretation and validation of cluster analysis.Table 5 gives an overview of the interpretation of the differentscores. This was done similar to [40].

In case the score is below 0, the data point does not belong tothis cluster.

4.3.2.1 Baseline Approach Summary: Results and Reason-ing The summary is based on the initial work, which can be foundunder [40]. In section 2.2 we provided an overview of the baselineapproach and in this section, we will focus on their results.

Step 7 from the AVS pipeline, would segment the visitors intoclusters, using k-modes clustering described in section 2.2. Thequality the clusters was evaluated based on the Silhouette Coeffi-cient and for the actual calculation of the score, they used a pre-computed distance matrix. This matrix has to do with how closethe items/points are. However, due to the capacity of the functionused to calculate the matrix, they used only 10 000 rows from theinitial dataset.

In their work several values for ’k’ were investigated - 28, 80,112, 27, 14; and their Silhouette Score. The interesting thing in theirwork is the fact that, they had a huge amount of data, cluster 0,situated in the negative section, Figure 7.

Figure 7: Silhouette Score for k = 28.After assuming that cluster 0 represents a ’noise bucket’, which

contains all the data unfit to be clustered. The cluster and the7


Table 3: Comparison of Model Performance based on Stack Overflow Dataset.

# Runs Min ACC(%) Max ACC(%) Mean ACC(%) Mean NMI(%) Std. Deviation (ACC) Std. Deviation (NMI)1 46.8 63 53,3 49.42 2.92 2.472 46.4 57.8 51.8 48.24 2.53 2.193 46 59.9 52.34 48.94 2.69 2.354 47.3 63.7 53.41 49.56 2.64 2.275 47.5 60 53.3 49.31 2.62 2.3

Average 46.80 60.88 52.83 49.09 2.68 2.32

Table 4: Execution Time Comparison - Pre-processing.

Processed Data Our Approach Baseline Approach

10000 2.9 seconds 1.6 seconds100000 29.6 seconds 19.3 seconds1000000 4 minutes 54 seconds 2 minutes 52 seconds

Table 5: Silhouettes Coefficient Interpretation. [40]

Range Interpretation0.71 - 1.0 Strong structure has been found0.51 - 0.7 Reasonable structure has been found0.26 - 0.5 Structure is weak, could be artificial<0.25 No substantial structure,

corresponding data were removed with the assumption that therest of the clusters will remain the same and the average score, thered dotted line, will increase significantly.

After removing the unfit data, they ran their clustering algorithmwith k = 27, to account for the deleted cluster, and were able toachieve an increase up to 0.61 in their average Silhouette score,Figure 8. A few additional plots from the baseline approach can beseen in Appendix A.

Figure 8: Silhouette Score for k = 27, unfit data removed.

The main takeaways from their results are:(1) Cluster 0 contains noisy data, which cannot be clustered,

thus should be used with caution.(2) Leaving the noise aside, the remaining data holds a good

structure. Good, is based on the score of the clusters and theinterpretation from Table 5 and describes the fact that thedata points are more related to the other points in the sameclusters than data points from other clusters.

(3) In their work, before the noise removal, the average score(dotted line) went above 0.2 for 112 clusters, which showsthat a high number of clusters is required in order to increasethe average score. In general, the Silhouette Coefficient willincrease with the increase in the number of clusters.

(4) With the removal of cluster 0, we observe an increase inthe average score, based on the structure of the remainingclusters. From figure 8, we can see that roughly half of theclusters are below 0.6, which according to Table 5 might beconsidered as weak or artificial structure.

4.3.2.2 Our Results The scores from the baseline method werebased on 10 000 rows and in order to ensure the adequacy of thecomparison we sampled the same amount of data, as we used thesame dataset. Interesting to point out is that in our case we did nothave a single cluster with noise, rather every cluster was partiallyfilled with noise. Considering this, we were not able to approachthe problem as it was dealt with in the baseline, due to the fact thatremoving whole clusters will also remove good data. Instead, wedecided to utilize Density-based spatial clustering of applicationswith noise (DBSCAN) for noise removal. We were able to findmultiple sources, such as the works of Jiapeng Huang et al.[14] andLi Ma et al., [23], in which DBSCAN has been proven to be usefulfor noise removal. As the idea was to generalize the solution, wedid not tune the two parameters in the algorithm, rather we usedthe default set up suggested in the documentation of scikit-learn[31].

Due to the size of the plots, the rest of this sub-section willcontain a summary of our results, including the main takeaways.The included plot is a direct comparison to the plot containing thebest result as reported in [40], which was for k = 27. The extendedversion of the evaluation can be seen in Appendix A, where weinclude more tests and visualizations to defend our conclusions.

Figure 9, shows that for k=28, with noise, our average Silhouettescore is 0.296, which although insignificant according to Table 5, isstill an improvement compared to the baseline approach.

In the baseline approach, Georgios Vletsas et al. [40] manuallyremoved the noise, which as they describe, accounts for large partof the dataset, yet they did not specify an exact percentage. In ourcase, the DBSCAN approach for noise removal resulted, similarly,in removing a large part of the data, roughly 40% for the keywordsbased on URL and roughly 30% for the keywords based on scrapedcontent. During the manual check we noticed that a lot of the largetransactions were removed, as well as others, which had mixedsignals in them. An example of the results can be seen in Figure 10, which shows the Silhouette representation for k = 27, similar tothe baseline approach.

8


Figure 9: Silhouette Score without removing Noise on theKeywords extracted from URL’s, for k = 28.

Figure 10: Silhouette Score with remove Noise on the Key-words extracted from URL’s.

An example of the final output can be seen in Figure 11, wherethe visitorId is mapped to the corresponding numeric cluster and aset of keywords representative of the cluster.

Figure 11: Filtered example of the final output of our ap-proach with corresponding cluster labels.

As an addition we evaluated our approach on keywords fromthe visited page, incorporating web scraping in our setup. However,as they did not do this in the baseline method, we present theresults in Appendix A and just summarize the results here.

The main takeaways from our results are:

(1) In the baseline paper, they did not provide an overall plotshowing the score per number of clusters, so we had tomake our conclusion based on the provided plots, wherethe highest average score was with k = 112, a bit over 0.2.The average score is based on the structure of all clustersand based on this our ’noisy’ clustering is better than thereported one. Although we had a better average score, thelack of a single ’noise’ bucket resulted in noise within theclusters. This resulted in poor quality , score below 0.6, forless then half of our clusters.

(2) We did not have a single ’noise’ bucket, however by usingthe default parameters of DBSCAN we were able to cleanthe overall noise and then outperform their reported score.They, indeed, do not specify that this is the best possiblescore, however they do not report a higher one.

(3) Using scrapped data drastically increased the input shape forthe Neural Network, however it was worth it as it showedbetter results.

(4) Taking it one step further wemapped the visitorId to a clustertogether with the corresponding keywords for this cluster.

5 Conclusion

The main conclusion of this thesis is that clustering the learnedfeatures from Convolutional Neural Networks outperforms the useof k-modes clustering with and without noise removal. We basethis conclusion on the Silhouette score, according to which ourclusters have a better quality. Besides this, we were able to extractcleaner keywords from the URL by creating a new pre-processingfunction. Although our clusters are better, domain knowledge isstill needed in order to make better sense out of the clusters, asalso suggested in [40]. This being said, the cluster labels can stillbe used as suggestions for new or missing clusters for the systemadministrators.

An interesting point is that in the baseline approach they had asingle cluster containing most of the noise, which allowed them tosimply remove the noisy cluster. In our case, we had to incorporateanother clustering algorithm - DBSCAN, as a noise removal step.Throughout our work we noticed that in some cases the transactionfunction, suggested by Ming-Syan Chen et al. [25], creates largetransactions, which cannot be clustered in a single cluster, as theyhave multiple representatives. With this in mind it would be worthto further investigate the grouping of URL visits. Another pointworth investigating is the use of other Neural Networks in com-bination with either keywords from the URL or keywords fromscraped content. For example Recurrent Neural Networks can beused to build own embeddings.

Last but not least, it might be useful to investigate another ap-proach for noise removal or fine tune DBSCAN for the data at hand.Bearing in mind that the default setting of the clustering algorithmwas able to improve our results drastically, we expect a fine tuningof its parameters to bring even more value.

As an addition, we conclude that using keywords from thescraped content, of a visited page, yields better results than simplyclustering visitors based on keywords from only the URL. Therefore

9


we can say that those the scraped keywords are more descriptive.This is indeed interesting and would help in the user segmentation.Having in mind we were not able to find an open dataset suitablefor this.

6 Acknowledgments

I would like to express my gratitude to Michael Metternich forgiving me the opportunity to be part of BloomReach and to workon this project, as well as for all the discussions and guidance. Iwould also like to especially thank Chang Li for beingmy supervisor,for always finding time for a short discussion and for his feedbackthroughout the project. Last but not least, I would thank my family,close friends and colleagues for all the support and help they havegiven me. Last but not least, I would like to thank Dr. Maarten Marxfor the agreeing to be my second examiner.

References

[1] Nils Ackermann and Nils Ackermann. 2018. Introduction to 1D Con-volutional Neural Networks in Keras for Time Sequences. (Sep 2018).https://blog.goodaudience.com/introduction-to-1d-convolutional-neural-networks-in-keras-for-time-sequences-3a7ff801a2cf

[2] Jason Brownlee. 2019. A Gentle Introduction to Convolutional Layers for DeepLearning Neural Networks. (Apr 2019). https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/

[3] Jason Brownlee. 2019. How to Develop 1D Convolutional Neural NetworkModelsfor Human Activity Recognition. (Apr 2019). https://machinelearningmastery.com/cnn-models-for-human-activity-recognition-time-series-classification/

[4] Jason Brownlee. 2019. How to Use Word Embedding Layers for Deep Learn-ing with Keras. (May 2019). https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

[5] Wikipedia Community. 2019. Hungarian algorithm. (May 2019). https://en.wikipedia.org/wiki/Hungarian_algorithm

[6] Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. 1997. Groupingweb page references into transactions for mining world wide web browsingpatterns. In Proceedings 1997 IEEE Knowledge and Data Engineering ExchangeWorkshop. IEEE, IEEE, 3 Park Avenue, 17th Floor New York, 2–9.

[7] Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. 1999. Data Prepara-tion for Mining World Wide Web Browsing Patterns. Journal of Knowledge andInformation Systems 1 (04 1999). https://doi.org/10.1007/BF03325089

[8] Aysegul Dundar, Jonghoon Jin, and Eugenio Culurciello. 2015. Convolu-tional Clustering for Unsupervised Learning. CoRR abs/1511.06241 (2015).arXiv:1511.06241 http://arxiv.org/abs/1511.06241

[9] J E. Angus. 1991. Criteria for Choosing the Best Neural Network: Part 1. MissingMissing, Missing (07 1991), 28.

[10] Karl Groves. 2007. The Limitations of Server Log Files for Usability Analysis.(Oct 2007). http://boxesandarrows.com/the-limitations-of-server-log-files-for-usability-analysis/

[11] Fred Guth. 2019. Mutual information. (Jun 2019). https://en.wikipedia.org/wiki/Mutual_information

[12] Kashmir Hill. 2016. How Target Figured Out A Teen Girl Was Pregnant BeforeHer Father Did. (Mar 2016). https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#97b645366686

[13] Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou.2015. Short text understanding through lexical-semantic analysis. Proceedings- International Conference on Data Engineering 2015 (05 2015), 495–506. https://doi.org/10.1109/ICDE.2015.7113309

[14] Jiapeng Huang, Yanqiu Xing, Haotian You, Lei Qin, Jing Tian, and JianmingMa. 2019. Particle Swarm Optimization-Based Noise Filtering Algorithm forPhoton Cloud Data in Forest Area. Remote Sensing 11, 8 (Apr 2019), 980. https://doi.org/10.3390/rs11080980

[15] Alon Jacovi, Oren Sar Shalom, and Yoav Goldberg. 2018. Understanding Convo-lutional Neural Networks for Text Classification. CoRR abs/1809.08037 (2018).arXiv:1809.08037 http://arxiv.org/abs/1809.08037

[16] Jhabel. 2019. Unsupervised learning. (Jun 2019). https://en.wikipedia.org/wiki/Unsupervised_learning

[17] Julia Kho and Julia Kho. 2018. How to Web Scrape with Python in 4 Minutes.(Sep 2018). https://towardsdatascience.com/how-to-web-scrape-with-python-

in-4-minutes-bc49186a8460[18] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In

Proceedings of the 2014 Conference on Empirical Methods in Natural LanguageProcessing (EMNLP). Association for Computational Linguistics, Doha, Qatar,1746–1751. https://doi.org/10.3115/v1/D14-1181

[19] Kku. 2019. E-commerce. (May 2019). https://en.wikipedia.org/wiki/E-commerce[20] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent Convolutional Neu-

ral Networks for Text Classification. In Proceedings of the Twenty-Ninth AAAIConference on Artificial Intelligence (AAAI’15). AAAI Press, Austin, Texas, Ar-ticle 2886636, 7 pages. http://dl.acm.org/citation.cfm?id=2886521.2886636

[21] Tilman Lange, Volker Roth, Mikio Braun, and Joachim Buhmann. 2004. Stability-Based Validation of Clustering Solutions. Neural computation 16 (07 2004),1299–323. https://doi.org/10.1162/089976604773717621

[22] ALEXANDER LENAIL. 2019. NN-SVG. (2019). http://alexlenail.me/NN-SVG/index.html

[23] Li Ma, Lei Gu Bo Li, Sou yi Qiao, and Jin Wang. 2014. G-DBSCAN: An ImprovedDBSCAN Clustering Method Based On Grid. In Conference Papers. 23–28. https://doi.org/10.14257/astl.2014.74.05

[24] Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey Dean. 2013. Efficient Esti-mation of Word Representations in Vector Space. Proceedings of Workshop atICLR 2013 (01 2013).

[25] Ming-Syan Chen, Jong Soo Park, and P. S. Yu. 1996. Data mining for path traversalpatterns in a web environment. In Proceedings of 16th International Conferenceon Distributed Computing Systems. IEEE, IEEE, 3 Park Avenue, 17th Floor NewYork, 385–392. https://doi.org/10.1109/ICDCS.1996.507986

[26] Harsh Pokharna. 2016. The best explanation of Convolutional Neural Networkson the Internet! (Jul 2016). https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-the-internet-fbb8b1ad5df8

[27] Nikhil Prakash. 2019. TfâĂŞidf. (May 2019). https://en.wikipedia.org/wiki/TfâĂŞidf

[28] Radim Rehurek. 2019. gensim: topic modelling for humans. (Apr 2019). https://radimrehurek.com/gensim/summarization/keywords.html

[29] Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretationand validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53 – 65.https://doi.org/10.1016/0377-0427(87)90125-7

[30] Sayantini. 2019. Keras vs TensorFlow vs PyTorch | Deep Learning Frameworks.(May 2019). https://www.edureka.co/blog/keras-vs-tensorflow-vs-pytorch/

[31] scikit-learn developers. 2019. sklearn.cluster.DBSCAN. (2019). https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

[32] scikit-learn developers. 2019. sklearn .metr ics .normalizedmutualinf oscore .(2019). https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html

[33] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks fromOverfitting. Journal of Machine Learning Research 15 (06 2014), 1929–1958.

[34] StackOverflow. 2012. Predict Closed Questions on Stack Overflow. (2012). https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow/

[35] Google Developers Team. 2019. Classification: Accuracy | Machine LearningCrash Course | Google Developers. (2019). https://developers.google.com/machine-learning/crash-course/classification/accuracy

[36] HostingFacts Team. 2019. Internet Statistics and Facts (Including Mobile) for2019. (2019). https://hostingfacts.com/internet-facts-stats/

[37] Keras.io Team. 2019. Core Layers. (2019). https://keras.io/layers/core/[38] Keras.io Team. 2019. Keras Documentation. (2019). https://keras.io/layers/

embeddings/[39] /@urvashilluniya. 2019. Why Data Normalization is necessary for Machine

Learning models. (Apr 2019). https://medium.com/@urvashilluniya/why-data-normalization-is-necessary-for-machine-learning-models-681b65a05029

[40] Georgios Vletsas. 2018. Automated Visitor Segmentation and Targeting. InMasterThesis. University of Amsterdam, Master Software Engineering, Science Park 904,Amsterdam, the Netherlands, 43. http://scriptiesonline.uba.uva.nl/document/660630

[41] Jenq-Haur Wang, Ting-Wei Liu, Xiong Luo, and Long Wang. 2018. An LSTMApproach to Short Text Sentiment Classification with Word Embeddings. InProceedings of the 30th Conference on Computational Linguistics and SpeechProcessing (ROCLING 2018). The Association for Computational Linguisticsand Chinese Language Processing (ACLCLP), Hsinchu, Taiwan, 214–223. https://www.aclweb.org/anthology/O18-1021

[42] Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang,and Hongwei Hao. 2015. Short Text Clustering via Convolutional Neural Net-works. In Proceedings of the 1stWorkshop on Vector SpaceModeling for NaturalLanguage Processing. Association for Computational Linguistics, Denver, Col-orado, 62–69. https://doi.org/10.3115/v1/W15-1509

[43] Ye Zhang and Byron C.Wallace. 2015. A Sensitivity Analysis of (and Practitioners’Guide to) Convolutional Neural Networks for Sentence Classification. CoRRabs/1510.03820 (2015). arXiv:1510.03820 http://arxiv.org/abs/1510.03820

10

https://blog.goodaudience.com/introduction-to-1d-convolutional-neural-networks-in-keras-for-time-sequences-3a7ff801a2cf

https://blog.goodaudience.com/introduction-to-1d-convolutional-neural-networks-in-keras-for-time-sequences-3a7ff801a2cf

https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/

https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/

https://machinelearningmastery.com/cnn-models-for-human-activity-recognition-time-series-classification/

https://machinelearningmastery.com/cnn-models-for-human-activity-recognition-time-series-classification/

https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

https://en.wikipedia.org/wiki/Hungarian_algorithm

https://en.wikipedia.org/wiki/Hungarian_algorithm

https://doi.org/10.1007/BF03325089

http://arxiv.org/abs/1511.06241


http://boxesandarrows.com/the-limitations-of-server-log-files-for-usability-analysis/

http://boxesandarrows.com/the-limitations-of-server-log-files-for-usability-analysis/

https://en.wikipedia.org/wiki/Mutual_information

https://en.wikipedia.org/wiki/Mutual_information

https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#97b645366686



https://doi.org/10.1109/ICDE.2015.7113309

https://doi.org/10.1109/ICDE.2015.7113309

https://doi.org/10.3390/rs11080980

https://doi.org/10.3390/rs11080980



https://en.wikipedia.org/wiki/Unsupervised_learning

https://en.wikipedia.org/wiki/Unsupervised_learning

https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460

https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460

https://doi.org/10.3115/v1/D14-1181

https://en.wikipedia.org/wiki/E-commerce

http://dl.acm.org/citation.cfm?id=2886521.2886636

https://doi.org/10.1162/089976604773717621

http://alexlenail.me/NN-SVG/index.html

http://alexlenail.me/NN-SVG/index.html

https://doi.org/10.14257/astl.2014.74.05

https://doi.org/10.14257/astl.2014.74.05

https://doi.org/10.1109/ICDCS.1996.507986

https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-the-internet-fbb8b1ad5df8

https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-the-internet-fbb8b1ad5df8

https://en.wikipedia.org/wiki/Tf–idf

https://en.wikipedia.org/wiki/Tf–idf

https://radimrehurek.com/gensim/summarization/keywords.html

https://radimrehurek.com/gensim/summarization/keywords.html

https://doi.org/10.1016/0377-0427(87)90125-7

https://www.edureka.co/blog/keras-vs-tensorflow-vs-pytorch/

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.normalized_mutual_info_score.html

https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow/

https://www.kaggle.com/c/predict-closed-questions-on-stack-overflow/

https://developers.google.com/machine-learning/crash-course/classification/accuracy

https://developers.google.com/machine-learning/crash-course/classification/accuracy

https://hostingfacts.com/internet-facts-stats/

https://keras.io/layers/core/

https://keras.io/layers/embeddings/

https://keras.io/layers/embeddings/

https://medium.com/@urvashilluniya/why-data-normalization-is-necessary-for-machine-learning-models-681b65a05029

https://medium.com/@urvashilluniya/why-data-normalization-is-necessary-for-machine-learning-models-681b65a05029

http://scriptiesonline.uba.uva.nl/document/660630

http://scriptiesonline.uba.uva.nl/document/660630

https://www.aclweb.org/anthology/O18-1021

https://www.aclweb.org/anthology/O18-1021

https://doi.org/10.3115/v1/W15-1509




A Extended Evaluation

A.1 Baseline Approach ExtendedThe rest of the baseline evaluation was around investigating theoutcomes when k is set to a lower value (k = 14, Figure 12) andhigher value than 27 (k = 40, Figure 13) The smaller choice for kled to the creation of new noise, as there were not enough clustersto separate the remaining data. While investigating the latter one,it became obvious that after cluster 27 the clusters were starting toget split into smaller ones - sub clusters.

Figure 12: Silhouette Score for k = 14, after removing unfitdata.

Figure 13: Silhouette Score for k = 40, after removing unfitdata.

A.2 Our Results with in-house dataIn this subsection we will present some additional results in regardsto the keywords extracted from URL’s, with and without noise. Asan addition we will also present our results, which are based on thekeywords extracted from the scraped content of the visited, by theuser, page. The last set of figures will present the overall movementof the average Silhouette score. For the purpose of evaluating thebaseline approach, we choose the same number of clusters as theones presented in the baseline evaluation.

Figure 14 and Figure 15 are showing the cluster quality for theKeywords, from the URL, based on k = 80 and k = 112, with noise

Figure 14: Silhouette Score without removing Noise on theKeywords from URL’s, k = 80.

Figure 15: Silhouette Score without removing Noise on theKeywords from URL’s, k = 112.

Figure 16 and Figure 17 are showing the cluster quality for theKeywords, from the URL, based on k = 14 and k = 40, with noiseremoved based on DBSCAN default parameters

Figure 16: Silhouette Score with removed Noise on the Key-words from URL’s, k = 14.

11


Figure 17: Silhouette Score with removed Noise on the Key-words from URL’s, k = 40.

Figure 18 and Figure 19 are showing the cluster quality for theKeywords based on scraped content, without removing noise. Basedon k = 80 and k = 112.

Figure 18: Silhouette Score without removing Noise on theKeywords based on scraped content, k = 80.

Figure 19: Silhouette Score without removing Noise on theKeywords based on scraped content, k = 112.

Figure 20 and Figure 21 are showing the cluster quality for theKeywords based on scraped contents, with noise removed basedon DBSCAN default parameters. Based on k = 14 and k = 40.

Figure 20: Silhouette Score with removed Noise on the Key-words based on scraped content, k = 14.

Figure 21: Silhouette Score with removed Noise on the Key-words based on scraped content, k = 40.

Figure 22: Silhouette Avg Score movement with Noise onURL Keywords.

Figure 22 is showing the overall movement of the average qualityof the clusters based on Keywords extracted from URL’s, with noise.The figure shows that our highest score is 0.437 for k = 120, whereas

12


in comparison, before the noise removal, the baseline performancewas around 0.2. As the figure shows, the score keeps growing withthe number of clusters. We tested this statement by running a loop,with possible cluster numbers, until 400. The score for k = 400was 0.578 and the graph represented a continuous growth. Figure23 is showing the overall score with removed noise, where thehighest score is 0.999 with 100 clusters. We tested with maximum120 clusters

Figure 23: Silhouette Avg Score movement without Noise onURL Keyword.

Figure 24 is showing the average quality of the clusters basedon Keywords extracted from the scraped content, with noise. Thefigure shows that our highest score is 0.623 for k = 118. Figure 25 isshowing the overall score with removed noise, where the highestscore is 0.93 with k = 64. We tested with maximum 120 clusters

Figure 24: Silhouette Avg Score movement with Noise onScraped Keywords.

Figure 25: Silhouette Avg Score movement without Noise onScraped Keywords.

13

unsupervised website visitor segmentation based on

Documents