interpreting web usage patterns generated using a hybrid

12
International Review on Computers and Software (I.RE.CO.S.), Vol. 7, n. 3 Copyright © 2012 Praise Worthy Prize S.r.l. - All rights reserved Interpreting Web Usage Patterns Generated Using a Hybrid SOM- Based Clustering Technique Ammar M. Huneiti 1 Abstract The rapid and huge growth of the web has emphasized the need to monitor the behavior of web users and to identify their interest, knowledge, preferences, goals, etc. This paper introduces a methodology for classifying users and pages of an educational online hypermedia using a hybrid clustering technique based on Self Organizing Map (SOM) neural networks. This paper also introduces an analytical cluster validation and interpretation approach to verify and explain the generated clusters of users and pages. The implemented cluster validation process utilizes a silhouette-based quantitative measure, while a combined data visualization and statistical cluster interpretation technique is proposed. Several experiments have been carried out using real data collected through special lab sessions of real students navigating an online tutorial. Experimental results indicated that the proposed methodology was able to prototype users and to recognize the association between pages based on their usage. Moreover, the topic of interest and the users interested in these topics were also identified. Keywords: Web usage mining, SOM, Cluster validation and interpretation, User modelling, Adaptive hypermedia I. Introduction Nowadays, we are witnessing a huge increase in the resources and services hosted on the web. A similar increase in web sites and web users and in their diversity is also evident. In the last few years, the huge reduction in the prices of Internet subscription fees, online services, and the computerized hardware has resulted in an explosive growth in the number of Internet users. Moreover, web sites are becoming more complex with regards to their structure, provided services, and the large number of documents that they exhibit including diverse content of different media types. This rapid growth has caused many concerns related to the publishing of online hypermedia such as cognitive overload and hypermedia disorientation. In addition, issues such as author-enforced structures and one-size-fits-all material are becoming more of a concern and need to be reviewed. These problematic issues are caused by the traditional and static methods of authoring online hypermedia, where users are not or cannot be considered in advance. Until recently and according to [1], the author is committed to the form as well as to the content of the work, well in advance of the actual time at which it is presented. There is a consensus among researchers that powerful and advanced authoring tools and technologies such as fast graphics, portable smart devices, cheap internet, hypermedia editing and authoring software etc, cannot help to improve the quality of the published material unless similar powerful and advanced authoring methods are utilised [2,3]. Thibeau [4] states that “unless the information meets reader needs, in the way reader needs to see it, these tools will never reach their potential”. Technological advances in the field of information processing, storage, presentation, and retrieval have always had a great influence on the way that online hypermedia is being authored and published. In addition, data mining techniques are considered as a major tool that can recognize hidden usage patterns and therefore, provide a very useful feedback about pages and/or users of online hypermedia. Extracting these patterns will almost certainly contribute to the re- authoring or re-structuring of the online hypermedia in a user adaptive manner and will provide the author with a very useful source of information about the real needs of the end-users. As a result, the need to monitor the behavior of users and their interest is growing. Consequently, the identification of the association relationship between different pages on the web and even different web sites based on their usage is also of a great importance. The above mentioned issues were the real drivers behind the research introduced in the field of web data mining. At the present time, this field of research is becoming much more mature and more specialized. Web usage mining is a special type of web data mining which also includes web content mining, web structure mining, and even web opinion mining.

Upload: others

Post on 24-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interpreting Web Usage Patterns Generated Using a Hybrid

International Review on Computers and Software (I.RE.CO.S.), Vol. 7, n. 3

Copyright © 2012 Praise Worthy Prize S.r.l. - All rights reserved

Interpreting Web Usage Patterns Generated Using a Hybrid SOM-

Based Clustering Technique

Ammar M. Huneiti1

Abstract – The rapid and huge growth of the web has emphasized the need to monitor the

behavior of web users and to identify their interest, knowledge, preferences, goals, etc. This paper

introduces a methodology for classifying users and pages of an educational online hypermedia

using a hybrid clustering technique based on Self Organizing Map (SOM) neural networks. This

paper also introduces an analytical cluster validation and interpretation approach to verify and

explain the generated clusters of users and pages. The implemented cluster validation process

utilizes a silhouette-based quantitative measure, while a combined data visualization and statistical

cluster interpretation technique is proposed. Several experiments have been carried out using real

data collected through special lab sessions of real students navigating an online tutorial.

Experimental results indicated that the proposed methodology was able to prototype users and to

recognize the association between pages based on their usage. Moreover, the topic of interest and

the users interested in these topics were also identified.

Keywords: Web usage mining, SOM, Cluster validation and interpretation, User modelling,

Adaptive hypermedia

I. Introduction

Nowadays, we are witnessing a huge increase in the

resources and services hosted on the web. A similar

increase in web sites and web users and in their

diversity is also evident. In the last few years, the huge

reduction in the prices of Internet subscription fees,

online services, and the computerized hardware has

resulted in an explosive growth in the number of

Internet users. Moreover, web sites are becoming more

complex with regards to their structure, provided

services, and the large number of documents that they

exhibit including diverse content of different media

types. This rapid growth has caused many concerns

related to the publishing of online hypermedia such as

cognitive overload and hypermedia disorientation. In

addition, issues such as author-enforced structures and

one-size-fits-all material are becoming more of a

concern and need to be reviewed. These problematic

issues are caused by the traditional and static methods

of authoring online hypermedia, where users are not or

cannot be considered in advance.

Until recently and according to [1], the author is

committed to the form as well as to the content of the

work, well in advance of the actual time at which it is

presented. There is a consensus among researchers that

powerful and advanced authoring tools and technologies

such as fast graphics, portable smart devices, cheap

internet, hypermedia editing and authoring software etc,

cannot help to improve the quality of the published

material unless similar powerful and advanced

authoring methods are utilised [2,3]. Thibeau [4] states

that “unless the information meets reader needs, in the

way reader needs to see it, these tools will never reach

their potential”. Technological advances in the field of

information processing, storage, presentation, and

retrieval have always had a great influence on the way

that online hypermedia is being authored and published.

In addition, data mining techniques are considered as a

major tool that can recognize hidden usage patterns and

therefore, provide a very useful feedback about pages

and/or users of online hypermedia. Extracting these

patterns will almost certainly contribute to the re-

authoring or re-structuring of the online hypermedia in a

user adaptive manner and will provide the author with a

very useful source of information about the real needs

of the end-users.

As a result, the need to monitor the behavior of users

and their interest is growing. Consequently, the

identification of the association relationship between

different pages on the web and even different web sites

based on their usage is also of a great importance. The

above mentioned issues were the real drivers behind the

research introduced in the field of web data mining. At

the present time, this field of research is becoming much

more mature and more specialized. Web usage mining

is a special type of web data mining which also includes

web content mining, web structure mining, and even

web opinion mining.

Page 2: Interpreting Web Usage Patterns Generated Using a Hybrid

A. M. Huneiti

International Review on Computers and Software, Vol. 7, n. 3

This paper introduces a methodology for classifying

users and pages of online hypermedia using a hybrid

clustering technique. At the core of this clustering

technique is a Self Organizing Map (SOM) neural

network. The paper also implements a cluster validation

and interpretation analysis in order to substantiate the

resulting clusters. Section 2 is a review of the literature

related to the application of web usage mining for

classifying online pages and users. Section 3 introduces

the SOM-based methodology for extracting patterns of

users and associated pages. Section 4 presents the

experimental results. Finally, section 5 presents future

work and concludes the paper.

II. Web Usage Mining for Classifying

Pages and Users

A common direction for most of the recent work

conducted on online hypermedia is to provide user-

centred and task-specific information to end users [5,6].

This direction was adopted by many state-of-the-art

hypermedia systems such as El-Tech [7], IPM [8],

MMA [9], and ADAPTS [10]. Users of online

hypermedia have different levels of knowledge,

expertise, and qualifications, and also different goals

and objectives. Many researchers such as those in [4,5]

argue that hypermedia authoring should be user-centric

rather than author-driven. As a result, data mining

techniques were utilized with many web-based

hypermedia systems to advice and manage users’ way

of navigation through the online material [11,12].

Moreover, the information that is presented to users has

to vary in its focus, level of detail, and presentation

format. It has to be adapted to the information needs of

the users [13]. Therefore, the main objective of many

web-related research endeavors was to identify the

similarity or the strength of association between web

users and/or between pages in a quantitative manner.

Many mathematical models were utilized in order to

measure the similarity relationship between pages or

users.

Recognizing the similarity between users is part of an

ongoing research into user modeling which aims at

identifying the knowledge, interest, goal, or preferences

of users of a web site(s). User models are the

representation of the user’s state of mind [14]. Modeling

users can be used for personalizing and/or customizing

web sites to match the needs of individual users or a

group of users, respectively. In addition, user models

are an essential requirement for adaptive hypermedia

systems [15], which aim at delivering user-customized

information that is concise, specific, relevant, and easy

to understand. Adaptive hypermedia is defined in [16]

as “all hypermedia systems which reflect some features

of the user in the user model and apply this model to

adapt various visible aspects of the system to the user”.

Adaptive hypermedia attempts to solve several

problems associated with the static design of

hypermedia including author-enforced structures,

cognitive overload, disorientation in hyperspace, and

one-size-fits-all material [17]. As mentioned earlier,

many different features of the user can be used to adapt

the conveyed information including user knowledge,

goal, background, experience, preferences, etc. The first

two are the most commonly used features of the user in

adaptive hypermedia and they are, along with other

features, encapsulated in a user model. Moreover,

adaptive online hypermedia systems that improve their

organisation and presentation by learning from visitor

access patterns are reported in [18,19]. A common

characteristic of these systems is that they apply data

mining techniques to users’ access logs, which record

the behaviour of every user within the web site, in order

to fine-tune the web site and its information to the users’

needs.

In addition to user modelling, recognizing the

similarity between web pages is mainly an investigation

into the classification, clustering and organisation of

online pages [12, 20]. The classification and clustering

of online pages is normally associated with supervised

and unsupervised machine learning techniques,

respectively. Data mining techniques and in particular

clustering techniques are used to classify pages with

regard to their content, link structure, and usage [21,22].

Finding the content-based similarity between web pages

is very useful for categorisation of pages and topic

identification for rapid and accurate information

retrieval [23]. The content of pages can be text based

and any other type of media such as images, audio, and

video. In addition, similarity of pages can be assessed

with regard to their existing link structures which

include incoming and outgoing links. Pages that

reference the same set of pages can be considered

similar and on the other hand, pages that are referenced

by the same set of pages can also be considered similar

[24]. Moreover, usage-based similarity is concerned

with tracking the users’ interaction with the hypermedia

in order to identify the association between pages [12].

As a result, web data mining can be divided into

three main types including, web usage mining, web

content mining and web structure mining. Although it is

beyond the scope of this work, and for completeness, it

is worth mentioning that an emerging research direction

related to web data mining is the web opinion mining

which is mostly associated with users of social

networks. According to Romero & Ventura (2007) web

data mining can be categorised as (i) clustering,

classification, and outlier detection (ii) association rule

and sequential pattern mining and (iii) text mining.

Normally, the content, the presentation, and the link

structure of online pages are designed and generated by

the author of the hypermedia. In contrast, the usage of

pages is influenced by the users of the hypermedia

Page 3: Interpreting Web Usage Patterns Generated Using a Hybrid

A. M. Huneiti

International Review on Computers and Software, Vol. 7, n. 3

whom they generate navigational paths through the

hypermedia which reflect patterns that need to be

discovered. These navigational paths were described by

[23] as the “footprints” of the users stored in log files.

Extracting and interpreting these patterns or footprints

generated by users of online hypermedia systems is the

main objective of web usage mining [25]. Web usage

mining is defined by [26] as the discovery and analysis

of patterns in data collected from the users’ interactions

with the website. The navigation of every user accessing

any web site is registered at the web server log file(s).

These are huge time-stamped files that record all details

of every interaction for every user with the web site.

Extracting meaningful patterns from these files can

support the development of online recommender

systems, achieve better performance of information

retrieval systems, enable the building of online user-

adaptive systems, re-structuring web sites in a user-

centred manner, identifying customers’ interests, habits,

and preferences for online e-commerce systems,

optimising the network performance and better

configuring the web/proxy server, and many other

applications. The most popular application for web

usage mining is in educational hypermedia

[11,13,23,27] where logs are gathered for students

interacting with an online educational material such as

tutorials, documentation, lessons, etc.

Kohonen’s Self Organizing Map (SOM) [28] is a

competitive artificial neural network (ANN) that is

classified as an unsupervised machine learning

technique. Many web data mining approaches have used

SOM as the main clustering technique [29], and in

particular for web usage mining [30,31]. As far as this

research is concerned, SOM was chosen as the primary

clustering technique because it is an unsupervised

learning technique which suits the nature of the

clustered data. It also enables the visualization of the

clustered data, normally, as a 2-D grid of location

sensitive clusters. SOM can deal with high-dimensional

data and map the results into a low-dimensional space

such as a user friendly grid topology that preserves the

spatial autocorrelation between clusters. It has also the

ability to deal with large number of clusters reaching as

much as hundreds of generated clusters. SOM is very

suitable for generating recommendations to users

because it implements a neighborhood-based

organization of clusters where data vectors can belong

to a certain cluster and, although not as strong, still have

an associative relationship with other vectors in

neighboring clusters. To the best of our knowledge, the

LOGSOM system [31] is the work most related to our

research because it proposes a web usage classification

method that utilizes k-means clustering combined with

SOM. In contrast, our work utilises SOM to cluster

users while LOGSOM uses SOM to cluster pages.

As mentioned in [32], the existing clustering

techniques do not provide an indication of the quality of

their outcome. The most important steps in any

clustering technique are the validation and interpretation

of the resulting clusters. Validation and interpretation of

clusters are concerned with generating quality valid

clusters and explaining the outcome of these clusters,

respectively. The validation of clusters resolves certain

issues related to the clustering performance such as

identifying the best number of clusters to generate,

detecting the fitness of the resulting clustering scheme

to the data set, recognizing the suitability of the

partitioning for the data set, etc [32]. The silhouette

measure is used by many researchers to overcome the

clusters validation issue [33,34]. It provides a robust and

quantitative measure of how well each data point fits

within its assigned cluster and, hence, it can validate the

overall clustering procedure. On the other hand,

clusters’ interpretation is concerned with analyzing the

normalized matrix

user/page matrix

k-means clustering of

pages

data

normalization

user/page binary

matrix

construction

SOM clustering of

users

Log File clusters

of pages

Interpretation of

SOM clusters

clusters

validation

clusters

validation

users prototypes

lists of associated pages

clusters of users

Fig. 1. Methodology for extracting usage patterns using SOM

Page 4: Interpreting Web Usage Patterns Generated Using a Hybrid

A. M. Huneiti

International Review on Computers and Software, Vol. 7, n. 3

resulting clusters in order to identify useful patterns and

formulate trends. Most researchers deploy statistical

methods in order to interpret the clustering results [24].

In addition, the visualization of the resulted clusters,

where applicable, is a very useful aid in detecting

hidden patterns within the data [35]. As far as our work

is concerned, a combination of statistical and data

visualization technique is used in order to interpret the

resulted clusters.

III. SOM-Based Methodology for

Extracting Usage Patterns

The methodology used to extract patterns of

users and pages from web usage data is depicted in

Fig. 1. The methodology mainly consists of three

main phases including (i) pre-SOM preparation

(ii) SOM clustering of users, and (iii) post-SOM

analysis. The pre-SOM preparation phase

comprises all tasks included within the dashed

polygon in Fig. 1, which aims at preparing the log

file data for efficient SOM clustering. This

includes constructing a user/page navigation

matrix, reducing the dimension of this matrix

using k-means clustering, and data normalization.

The post-SOM analysis aims at interpreting the

generated SOM clusters in order to identify

prototypes of users and to extract lists of relevant

and associated pages based on the usage data

extracted from the log file.

The methodology also highlights the need for

validating the results of every clustering step

before proceeding to the next phase. The clusters

validation precedes every clustering step in order

to identify the best clustering parameters to adopt.

III.1 Data pre-processing for applying SOM

Online usage data come in huge textual log file(s)

which require an extensive offline pre-processing

before applying any classification technique. The

pre-processing phase includes data cleansing,

reduction, transformation, normalization, and

modeling. In addition, individual users, pages and

users’ sessions must be identified.

The data used in this research is concerned with

students’ usage of an online hypermedia-based

tutorial of the Java© programming language

courses taught at the University of Jordan

including Object Oriented Programming I and II in

the spring semester of 2009/2010. The students

undertaking these courses were brought to special

Lab sessions and were requested to navigate

through the Java tutorial. These sessions were

approximately one hour each and all the students’

interactions were saved in a centralised log files.

These files consist of students’ past usage records

which register attributes such as the usage time,

the IP address, the requested URL, the request

method, the transport protocol, etc. As far as this

work is concerned, the first three attributes were

used including, usage time, IP address, and the

requested URL. As a result of the controlled Lab

sessions, individual users were easily identified by

their unique IP addresses and users’ sessions were

also identified by the date and time of every Lab

session.

III.1.1 Building the User/Page Navigation Matrix

The centralised log files were gathered and

processed in order to construct the user/page

navigation matrix. The collected data consist of

438 different users accessing 195 distinct pages.

These users generated a total of 8441 transactions.

Table I illustrates the constructed user/page matrix

which is of dimension 438x195, where Ui refers to

user i and Pi refers to page i.

TABLE I

USER/PAGE MATRIX

U2 U3 U4 U5 …. U438

P1 1 0 1 1 ….. 1

P2 1 0 0 1 ….. 0

P3 0 0 1 1 ….. 1

….. ….. ….. ….. ….. ….. …..

P195 0 0 1 1 ….. 0

The constructed user/page matrix is a binary

matrix where 1 indicates a “visit” and 0 indicates a

“no visit”. For instance, the matrix depicted in

Table I shows that U1 has visited P1 and did not

visit P2, P3 and so forth. It is difficult to apply the

SOM technique using this user/page matrix as an

input, because of its high dimensionality. This will

require that the number of input neurons to be

equivalent to the number of pages i.e. 195, which

is a very high number of the SOM’s input neurons.

This will distort the SOM and will yield results

that are difficult to interpret. Moreover, the

required processing time will be unacceptable

especially if the number of pages is more than 195

which is the case with many existing web sites.

Therefore, an initial classification phase is needed

in order to group similar pages together and enable

us to deal with groups of similar pages rather than

individual pages. This will significantly reduce the

dimensionality of the user/page matrix and

provides a more suitable data structure for SOM

clustering.

Page 5: Interpreting Web Usage Patterns Generated Using a Hybrid

A. M. Huneiti

International Review on Computers and Software, Vol. 7, n. 3

III.1.2 K-Means for Initial Classification of Pages

All visited pages are classified using the k-means

clustering technique in order to group individual

pages into k classes of pages. Due to its simplicity,

k-means clustering technique is widely used in

most hybrid clustering systems as an initial

classification technique [22,31]. This step is

normally followed by another clustering phase

using a more robust clustering technique.

As shown in Table I above, the data used for

clustering the pages has significant number of

zero-valued entries, where zero indicates “no

visit”. Clustering this type of data using any

conventional distance function such as the

Euclidean distance, which treats all entries as

equally significant, will yield a distorted result.

This is because the significant information is,

mainly, embedded in the nonzero elements of the

dataset. Moreover, the zero-valued elements

imply, more or less, that there is no information

acquired. Therefore, the clustering results not be

accurate if the zero and nonzero elements are

given the same significance.

As a result, the distance function selected for

the k-means clustering step was the Hamming

distance function. This function employs the

percentage of elements that differ in order to

measure the similarity between any pair of data

vectors. The Hamming distance function is only

suitable for binary data where it calculates the

frequency of occurrence of the patterns 01 and 10

within the two vectors in comparison, compared to

the overall length of the vectors. In addition, it is

important to decide on the most suitable number

of clusters (k) to use for clustering the dataset.

This is done by comparing the sum of the

silhouette values (SumSil) for all data points using

different values for k. The silhouette value for

each data point is a similarity measure of that

point to points in its assigned cluster compared to

points in the other clusters. The silhouette value

Sil(i) for a data point i is computed in [34] as

follows:

( ) ( ) ( )

( ) ( ) ( )

where a(i) is the average distance of data point i from

all other points within its assigned cluster, and b(i) is

the average distance of data point i from all other points

in the nearest cluster to its original cluster. The lower

the value of a(i) and the higher the value of b(i) the

more i is fit within its own cluster and vice versa. As

deducted from equation (1) above, the silhouette value

for any data point i will range from +1, indicating points

that are very distant from neighbouring clusters, through

0, indicating points that are not distinctly in one cluster

or another, to -1, indicating points that are probably

assigned to the wrong cluster. From the above

description it is clear that:

( ) ( )

By comparing the sums of the silhouette values of the

whole dataset using different number of clusters (k), we

can determine the best number of clusters (k) available

for the this particular dataset. The sum of silhouette

values for k clusters, SumSil(k), is defined as:

( ) ∑ ( )

( )

where Sil(i) is as defined in equation (1), and D is the

number of data points in the dataset. The higher the

SumSil the more suitable the number of clusters (k)

used. In theory, the highest positive value that the

SumSil can reach is equal to the number of data points

in the clustered dataset (+D), and this can only occur

when all data points score +1, and vice versa.

Considering the dataset used in this work where 195

pages are clustered, the highest value that SumSil can

reach is +195 (perfect clusters) and the lowest is -195

(completely wrong clusters). Table II shows the sum of

the silhouette values (SumSil) compared to different

number of k-means clusters (k). The SumSil values in

the table indicate that the most suitable number of

clusters, k, for this particular dataset is 3 clusters where

SumSil is the highest. TABLE II

SUMSIL VS K

k 3 4 5 6 7

SumSil 72.1 41.5 40.9 47.5 37.2

III.1.3 Data normalization to improve clustering

Data normalization is an important pre-requisite in order

to prepare the data resulted from the k-means phase for

yielding better results and more efficient SOM

clustering. This consists of two main normalization

steps including (i) converting the user/page matrix into

user/group of pages, and (ii) percentage-based

normalization of the resulting users’ navigational

vectors.

Recall that the initial k-means clustering phase has

classified all pages into three clusters (Table II) or

groups of pages (GoP). These groupings are used to

reduce the dimensionality of the user/page matrix and to

convert it to user/GoP matrix. This matrix is constructed

by calculating Vij, which is the total number of visits

generated by user i, to all pages that belong to the group

of pages j (GoPj).

( )

Page 6: Interpreting Web Usage Patterns Generated Using a Hybrid

A. M. Huneiti

International Review on Computers and Software, Vol. 7, n. 3

where Vi is the total number of visits generated by user

i. Hence, the overall navigation of any user i can be

described in terms of the vector where:

= <Vi1, Vi2, Vi3,….., Vik> (5)

As shown in Table III, the user/GoP matrix is the

concatenation of all vectors . This matrix has number

of columns equal to the number of users and it has

number of rows equal to the number of GoPs.

TABLE III

USER/GoP MATRIX

U1 U2 U3 U4 U5 …… U438

GoP1 1 21 11 0 11 …… 1

GoP2 4 7 0 0 2 …… 0

GoP3 6 0 0 25 5 …… 2

The total number of visits made by a user does not

fully express the actual navigational pattern of this user.

Alternatively, it would be more appropriate to use the

total number of visits made by a user to a GoP

compared to the overall visits made by the same user to

all other GoPs in a single session. Therefore, using the

percentage of the user’s visits to a group of pages in

comparison with his\her visits to other groups is more

representative of the overall navigational pattern of this

particular user.

For instance, considering that we have three GoPs,

and that user A has generated 21,7, and 0 visits to GoP1,

GoP2, and GoP3, respectively. Hence, the vector that

represents the user’s navigation is <21, 7, 0>. In the

mean time, if user B has generated the vector <3, 1, 0>,

then when the similarity between these two vectors is

calculated using any standard distance function it

would result in a comparatively high value i.e. low

similarity between these two navigational patterns.

Meanwhile, when considering the percentage

normalized values for these two vectors, they both

would converge towards the same vector <0.75, 0.25,

0>. Applying any distance function on these two vectors

would yield a distance of zero. Hence, the two

navigational patterns are exactly similar, which is the

case when considering navigational “patterns” rather

than “counts of visits”. Table IV illustrates the

normalized user/GoP matrix, which is a normalized

version of Table III, above. Every column in the table

represents the navigational pattern of an individual user

which is described using a 3-D vector of visits

percentage.

TABLE IV

NORMALISED USER/GoP MATRIX

U1 U2 U3 U4 U5 …… U438

GoP1 0.09 0.75 1 0 0.61 …… 0.33

GoP2 0.36 0.25 0 0 0.11 …… 0

GoP3 0.55 0 0 1 0.28 …… 0.67

III.2 Classification of Users Using SOM

SOM neural networks consists mainly of two layers,

namely, the input layer and the output layer. Every

neuron in the input layer is connected to all neurons in

the output layer. Even though the output layer can be 1-

D row of neurons or a 3-D mesh, normally, the output

layer is a 2-D lattice or grid of output neurons. As

illustrated in Fig. 2, the SOM’s input neurons are fully

connected to the output neurons using weighted

connections. This topology of neurons enables the

projection of the input patterns into a 2-D grid of output

neurons. The location of the winner output neuron

corresponds to a particular feature of the input pattern.

Although not as much, the entire family of pre-defined

winning neighborhood neurons, also, adapt to the input

patterns.

Fig. 3 outlines the pseudo code of the SOM

used in this study. The algorithm consists of two

phases (i) SOM initialization, and (ii) SOM

Learning. After a pre-defined number of learning

cycles, this topology will eventually correspond

with the principle of spatial autocorrelation, which

implies that the closer the output neurons to each

other, the more similar they are and the more they

resemble similar input patterns. Using clustering

terminology, the spatial autocorrelation of SOM,

provides a multi-level ranking mechanism to the

suitability of clusters to the corresponding input

patterns.

The data set that is used to train the SOM is the

one shown in Table IV. Unlike what was done at

the first phase of clustering using k-means, SOM

is used here to classify users rather than

classifying pages. All users are projected into a 2-

D grid according to their navigation patterns.

III.3 Interpretation of the SOM Clusters

In general, the interpretation of clusters is a

troublesome process which requires a great deal of

understanding of the nature of the original data.

The method deployed here is based on sorting the

original dataset, represented by the user/page

matrix of Table I, using the clusters resulted from

the SOM classification. The following hypothesis

is adopted for interpreting the resulted SOM

clusters. The hypothesis outlines the concept of

page usage similarity.

Page 7: Interpreting Web Usage Patterns Generated Using a Hybrid

A. M. Huneiti

International Review on Computers and Software, Vol. 7, n. 3

1. Initialization

1.1 Initialize the network

1.1.1 Set the number of input neurons

1.1.2 Set the number of output neurons

1.1.3 Decide on the topology of the output layer

1.2 Initialize neurons’ weights

1.2.1 initialize neuron weights to a small values (say an interval [0,1])

1.2.2 set the learning rate parameter α

1.3 Decide on the appropriate distance function DF

1.4 Decide on the appropriate neighborhood function NF

1.5 Set the number of learning cycles (Epochs) Eps

2. Learning

2.1 epoch =1

2.2 iteration =1

2.3 Apply data point di

2.4 Calculate the distance (D) from di to all neurons using DF

2.5 Identify the winning neuron Nw where D is minimum

2.6 Strengthen the weights of Nw towards di using α

2.7 Strengthen (although not as much) the weights of all neighboring neurons NN using NF

2.8 While iteration ≤ number of data points → increment iteration, go to 2.4

2.9 While epoch ≤ Eps → increment epoch and go to 2.2

Fig. 2. SOM topology

Fig. 3. SOM pseudo code

Hypothesis: Pages are deemed similar, related,

associated and relevant to each other if they are

frequently visited together in a single session by

many users. The more users visit the same

collection of pages the stronger the similarity

relationship between these pages.

The cluster interpretation technique used is

explained using the following example.

Let us assume that users U1, U2, and U3 have been

assigned to the same cluster (C1) and their

navigation patterns over 8 pages (P1...P8) are

summarized as follows:

P1 P2 P3 P4 P5 P6 P7 P8

C1

U1 1 0 1 1 0 0 1 1

U2 1 0 0 1 0 0 1 0 non-zero

average U3 1 1 0 0 0 0 1 1

Total 3 1 1 2 0 0 3 2 2

Users 1, 2 and 3 have similar navigational

patterns because they belong to the same SOM

cluster. By considering the above mentioned

hypothesis, it is clear that P1 and P7 are strongly

associated to each other because they were visited

by all users of C1. In order to be able to comment

on the association of the rest of the pages we need

a quantitative measurement of the degree of

association.

As shown in the above example, the total of

visits made by every user belonging to this cluster

to every page is calculated. Furthermore, a cluster

average of all “nonzero” totals of visits is also

calculated which is equal to 2. In order to

determine the degree of association between pages

it is suggested that all pages that have visits total

greater than or equal to the average can be

considered associated. Considering the navigation

of users of this cluster, P1 and P7 can be

considered associated with an association score of

3. On the other hand, P4 and P8 are also associated

within this cluster but with an association score of

2. The rest of pages are not considered associated

because they score less than the cluster average.

On the other hand, it can be deduced that users of

this cluster are highly interested in Pages P1 and

Output Layer

Input Layer

..…….

..….

Page 8: Interpreting Web Usage Patterns Generated Using a Hybrid

A. M. Huneiti

International Review on Computers and Software, Vol. 7, n. 3

P7, and although not as much, they are also

interested in pages P4 and P8. This method was

adopted for the analysis of the main clusters

generated by the SOM.

IV. Experimental Results

IV.1 SOM Validation

Different dimensions for the SOM’s grid topology

using different numbers of training cycles (epochs) were

tested. The sum of the silhouette values (SumSil) was

used in order to compare the results and to verify the

suitability of different grid arrangements and training

epochs. Table V shows the sum of the silhouette values

generated using different dimensions of the output grid

at different epochs. Taking into account that the number

of data points is 438, which indicates that, the maximum

value for the SumSil is +438 (very good) and the

minimum is -438 (very bad). The oval-surrounded cells

in the table represent the highest SumSil value within

every grid arrangement. Overall, these values indicate

that the best arrangement for the SOM’s output grid is

21x21 trained using 60 epochs.

As a result, the SOM used in this work consists

of 3 input neurons and an output grid topology of

21x21 neurons with a total of 441 output neurons.

It utilizes an adaptive training strategy where the

values of the learning rate and the neighborhood

distance are altered starting from an initial

maximum value towards a final pre-set minimum

value for fine-tuning the network. The learning

rate α is initially set to 0.9 and is altered

adaptively towards 0.02 at the end of the learning

phase. Similarly the neighborhood distance is

initially set to equal the maximum distance

between two neurons and decreases towards 1.

The distance function used is the well-known

Euclidean distance.

Fig. 4 represents a 3-D plot of the 438 data points

(red dots) overlapped by the 441 SOM neurons (line

connected blue stars) and trained over 60 epochs. The

figure shows the distribution of the SOM output neurons

compared to the distribution of the original dataset. It

can be noticed that after 60 training epochs the SOM

neurons have rearranged themselves around the training

data points and made a full inclusion of the data. The

generated 21x21 grid of output neurons is depicted in

Fig. 5, where spatial closeness resembles similarity. The

cells in the grid are numbered left to right starting from

cell 1 (top left corner) and finishing with cell 441 (right

bottom corner).

Fig. 4. SOM’s 441 output neurons after 60 epochs compared to training data

TABLE V

SUM OF SILHOUETTE VALUES OF DIFFERENT GRID DIMENSION AND EPOCHS

Epochs

Grid

10 20 40 60 80 100 120

10x10 274.5 297.1 281.6 284.6 306.1 307.8 279.3

15x15 324.1 328.9 359.4 351.4 329.9 323.5 343.9

21x21 364.2 364.1 367.8 375.9 364.3 358.1 368.8

Page 9: Interpreting Web Usage Patterns Generated Using a Hybrid

A. M. Huneiti

International Review on Computers and Software, Vol. 7, n. 3

Fig. 5. The resulted 21x21 SOM output grid after 60 epochs

IV.2 Analysis of SOM clusters

The generated SOM clusters depicted in Fig. 5 are

interpreted in two steps. Firstly, the 2-D output grid of

clusters is inspected and analyzed visually. The grid

shows that the most significant clusters are the three

clusters depicted using the oval shapes because they

contain the highest number of students. These clusters

are represented in cells 1, 21, and 428 of the SOM grid

and they contain 15, 74, and 47 students, respectively.

These clusters are a valid representation of the overall

trend of all the students’ navigation through the tutorial.

In addition, each cluster represents a user prototype and

therefore it can be suggested that the users of the Java

tutorial are, mainly, of three prototypes. The grid also

has a substantial number of cells containing only one or

two students. These can be interpreted as students

generating random clicks over the tutorial and/or

students with no defined interest in any valid tutorial

topic. Perhaps, these are students that did not take the

exercise seriously or they had an individualized and

personal topic of interest that did not match with the

interests of the overall population of students.

Secondly, the statistical procedure described in

section (3.3) was used to interpret the resulting grid of

clusters. The three main clusters of the SOM grid

depicted in Fig. 5 were analyzed in more details. Table

VI shows all the pages frequently visited by students of

cluster 1 and scoring above the average (average = 3.1).

The table shows that all these pages are sections of

chapters 8, 9, and 10 of the tutorial which were proven

associated by the navigation of the students of cluster 1.

On the other hand, these chapters are believed to be

related because the author of the tutorial had organized

them in succession. In addition, it can be noticed that

the main topic of this navigational pattern is the “Array

Object” where most pages are directly relevant or

indirectly supporting this theme.

TABLE VI

CLUSTER 1 - ALL PAGES SCORING ABOVE AVERAGE

Chapter Page Name Visits Total

'/chap10_05.html' 'Array length' 6

'/chap10_03.html' 'for loops' 5

'/chap09_06.html' 'Printing an object' 5

'/chap09_13.html' 'Generalization' 5

'/chap09_14.html' 'Algorithms' 5

'/chap09_05.html' 'Creating a new object' 4

'/chap08_12.html' 'Objects and primitives' 4

'/chap09_01.html' 'Class definitions and object types' 4

'/chap10_02.html' 'Copying arrays' 4

'/chap10_04.html' 'Arrays and objects' 4

'/chap10_01.html' 'Accessing elements' 4

'/chap09_09.html' 'Modifiers' 4

'/chap08_01.html' 'What's interesting?' 4

'/chap08_02.html' 'Packages' 4

'/chap08_04.html' 'Instance variables' 4

'/chap09_03.html' 'Constructors' 4

'/chap09_04.html' 'More constructors' 4

'/chap08_11.html' 'Garbage collection' 4

'/chap10_06.html' 'Random numbers' 4

Table VII shows the ten highest ranked pages

out of twenty pages frequently visited by students

belonging to cluster 21 and scoring above the

average (average=21.5). All of these pages are

sections of author-generated chapters 1, 2, and 3

of the tutorial. The main theme of this navigational

pattern can be categorized as “Introduction to Java

programming”.

C# 21 C# 1

C# 428

C# 22

group of

clusters

Page 10: Interpreting Web Usage Patterns Generated Using a Hybrid

A. M. Huneiti

International Review on Computers and Software, Vol. 7, n. 3

TABLE VII

CLUSTER 21– HIGHEST 10 RANKED PAGES

Chapter Page Name Visits Total

'/chap01_01.html' 'What is a programming language' 39

'/chap01_03.html' 'What is debugging' 37

'/chap02_02.html' 'Variables' 35

'/chap01_05.html' 'The first program' 34

'/chap02_05.html' 'Keywords' 31

'/chap02_03.html' 'Assignment' 31

'/chap02_04.html' 'Printing variables' 31

/chap02_01.html' 'More printing' 30

'/chap01_02.html' 'What is a program' 30

'/chap02_06.html' 'Operators' 30

Table VIII shows the fifteenth highest ranked

pages out of 56 pages visited by students of cluster

428 and scoring above the average (average=5.8).

All of these pages are sections of chapters 14 to 19

of the tutorial. The main theme of this navigation

pattern can be categorized as “Advanced data

structures”. TABLE VIII

CLUSTER 428 – HIGHEST 15 RANKED PAGES

Chapter Page Name Visits Total

'/chap17_01.html' 'A tree node' 12

'/chap14_02.html' 'The Node class' 11

'/chap15_03.html' 'The Java Stack Object' 11

'/chap15_05.html' 'Creating wrapper objects' 11

'/chap14_03.html' 'Lists as collections' 11

'/chap19_09.html' 'Performance of resizing' 11

'/chap17_02.html' 'Building trees' 11

'/chap15_02.html' 'The Stack ADT' 10

'/chap18_06.html' 'Definition of a Heap' 10

'/chap14_04.html' 'Lists and recursion' 9

'/chap14_10.html' 'The LinkedList class' 9

'/chap18_01.html' 'The Heap' 9

'/chap15_07.html' 'Getting the values out' 9

'/chap15_01.html' 'Abstract data types' 9

'/chap18_10.html' 'Heap sort' 9

The statistical analysis of the main clusters also

uncovered three main prototypes of students.

These prototypes are novice, intermediate, and

advanced, which correspond to clusters 21, 1, and

428, respectively. Students belonging to cluster 21

were interested in the first three chapters which

are introductory chapters that contain basic

information. Their navigation pattern has reflected

their knowledge of Java which can be classified as

novice. On the other hand, students of cluster 428

were interested in the last six chapters which

contain advanced information and cannot be

understood unless students have prior knowledge

in Java. Thus, their knowledge of Java can be

described as advanced. The navigation pattern of

students of cluster 1 showed that they were

interested in intermediate tutorial chapters.

A cluster neighborhood analysis shows that

clusters 21 and 428 have no direct neighbors.

Cluster 1 has three direct neighbors from which

cluster 22 has the most number of students.

Cluster 22 has 31 pages scoring above the average

that contain all of the 19 pages of the neighboring

cluster 1 (Table VI). The rest of the pages are

related to the main topic of cluster 1 which is the

“Array Object”. Table IX outlines a sample of

pages that are in cluster 22 and not in the

neighboring cluster 1. It is clear that pages of

cluster 22 complement and support the topic of the

pages of neighboring cluster 1.

Finally, the group of clusters highlighted by the

dark rectangle in Fig. 5 contains a total of 50

students. The main theme of the navigation of this

group is a combination of “java classes and

objects” and “Java data structures”. These results

are compatible with the themes found for cluster 1

at the top of the grid and cluster 428 at the bottom

of the grid.

TABLE IX

PAGES IN CLUSTER 22 AND NOT IN CLUSTER 1

Chapter Page Name Visits Total

'/chap10_08.html' 'Array of random numbers' 4

'/chap08_03.html' 'Point objects' 3

'/chap08_05.html' 'Objects as parameters' 3

'/chap08_06.html' 'Rectangles' 3

'/chap08_07.html' 'Objects as return types' 3

'/chap08_08.html' 'Objects are mutable' 3

'/chap08_09.html' 'Aliasing' 3

'/chap08_10.html' 'null object' 3

V. Conclusion and Future Work

In this paper, we have proposed a hybrid k-means

and SOM clustering technique for classifying

users and extracting the association between

navigated pages. In addition, the paper suggested

techniques for the validation and interpretation of

clusters. Our results show that users were

classified with regard to their navigational patterns

and they were successfully prototyped. The results

also showed that pages can be classified and

associated with each other by grouping similar

user navigation paths and averaging the total hits

within each group. We noticed that data

visualization is very useful for the interpretation of

the clustering results as it highlights the major

clusters and their neighborhood and in detecting

users of personal interests. Moreover, the

Page 11: Interpreting Web Usage Patterns Generated Using a Hybrid

A. M. Huneiti

International Review on Computers and Software, Vol. 7, n. 3

application of the silhouette measure to compare

and to validate the clustering results was effective.

In future work, we are interested in trying

different measurements to evaluate the interest of

users and thus the association of pages. The

measurement used in this paper is the hits count

where other factors may better indicate the interest

of a user towards a certain page or topic. This may

include the time spent reading (TSR) a page, the

scrolling time on a page, the link usage within a

page, etc. A comparative study is intended in the

near future in order to compare results using

different measurement for user interest. With

regard to user modelling, further research can be

conducted on investigating the application of

implicit user models that are automatically

detected by the system, in providing adaptive

information. The application of a real-time

recommendation system based on real-time

identification of the interest of the end-users can

also be further investigated.

References

[1] S. Aikat and D. Aikat, Shared Techniques

between Print and Online Documentation, Proc.

of the 14th

Annual International Conference on

Computer Documentation SIGDOC’96, pp. 125-

129, Research Triangle Park, NC, USA, 1996.

[2] J. Price, Introduction: Special Issue on

Structuring Complex Information for Electronic

Publication, IEEE Transaction on Professional

Communication, Vol. 40, n. 2, pp. 69-77, 1997.

[3] A. Csinger, K. S. Booth and D. Poole, AI Meets

Authoring: User Models for Intelligent

Multimedia, Artificial Intelligence Review, Vol.

8, n. 5-6, pp. 447-468, 1995.

[4] J. Thibeau, Making Information Work on the

World Wide Web, Proceedings of the 43rd

Annual Conference of the Society for Technical

Communication, pp. 374-378, 1996. [5] R. Z. Cabada, M. L. B. Estrada, C. A. R. García,

EDUCA: A web 2.0 authoring tool for

developing adaptive and intelligent tutoring

systems using a Kohonen network, Expert

Systems with Applications, Vol. 38, n. 8, pp.

9522-9529, August 2011.

[6] A. M. Huneiti, Data Models for Retrieving Task-

Specific and Technicians-Adaptive Hypermedia,

WSEAS Transactions on Computers, Vol. 7, n. 9,

pp. 1495-1504, Sept. 2008.

[7] J.W. Coffey, A. J. Canas, G. Hill, R. Carff, T.

Reichherzer and N. Suri, Knowledge Modelling

and the Creation of El-Tech: a Performance

Support and Training System for Electronic

Technicians, Expert Systems with Applications,

Vol. 25, pp. 483-492, 2003.

[8] D. T. Pham, and R. M. Setchi, Case-Based

Generation of Adaptive Product Manuals,

Proceedings of the Institution of Mechanical

Engineers IMech, Vol. 217 (B), pp. 313-322,

2003.

[9] L. Francisco-Revilla, and F.M. Shipman,

Adaptive Medical Information Delivery

Combining User, Task, and Situation Models,

Int. Conference on Intelligent User Interfaces,

ACM Press, pp. 94-97, 2000.

[10] P. Brusilovsky, D. W. and Cooper, ADAPTS:

Adaptive Hypermedia for a Web-based

Performance Support System, Proceedings of the

2nd

Workshop on Adaptive Systems and User

Modeling on the WWW, pp. 41-47, May 11-

14,1999.

[11] O. Mustapaşa, A. Karahoca, D. Karahoca, and

H. Uzunboylu, Hello World, Web Mining for E-

Learning, Procedia Computer Science, Vol. 3,

pp. 1381-1387, 2011.

[12] C. Dimopoulos, C. Makris, Y. Panagis, E.

Theodoridis, and A. Tsakalidis, A web page

usage prediction scheme using sequence

indexing and clustering techniques, Data &

Knowledge Engineering, Vol. 69, n. 4, pp. 371-

382, April 2010.

[13] Z. Shen, C. Miao, R. Gay, and C. P. Low,

Personalized e-Learning – a Goal Oriented

Approach, Proceedings of the 7th WSEAS

International Conference on Distance Learning

and Web Engineering (DIWEB '07), pp. 304 –

309, 2007.

[14] P. De Bra, Design Issues in Adaptive Web-Site

Development, Proceedings of the 2nd

Workshop

on Adaptive Systems and User Modelling on the

Web, pp. 29-39, 1999.

[15] E. Knutov, P. De Bra, and M. Pechenizkiy, AH

12 years later: a comprehensive survey of

adaptive hypermedia methods and techniques,

New review of hypermedia and multimedia, Vol.

15, n. 1, pp. 5-38, 2009.

[16] P. Brusilovsky, Methods and Techniques of

Adaptive Hypermedia, User Modeling and User-

Adapted Interaction, Vol. 6, pp. 87-129, 1996.

[17] D.W. Cooper, F. P. Veitch, M. M. Anderson and

M. J. Clifford, Adaptive Diagnostic and

Personalised Technical Support (ADAPTS),

Proceedings of the IEEE Aerospace Conference,

Vol.3, pp. 139-149,1999.

[18] X. He, H. Zha, C. H. Q. Ding, and H. D. Simon,

Web Document Clustering using Hyperlink

Structures, Computational Statistics & Data

Analysis, Vol. 41, n. 1, pp. 19-45, 2002.

[19] M. Perkowitz, and O. Etzioni, Towards Adaptive

Web Sites: Conceptual Framework and Case

Study, Artificial Intelligence, Vol. 118, n. 1-2,

pp. 245-275, 2000.

[20] M. H. Chehreghani, H. Abolhassani, M. H.

Chehreghani, Density link-based methods for

Page 12: Interpreting Web Usage Patterns Generated Using a Hybrid

A. M. Huneiti

International Review on Computers and Software, Vol. 7, n. 3

clustering web pages, Decision Support Systems,

Vol. 47, n. 4, pp. 374-382, November 2009.

[21] A. Romero, S. Ventura, A. Zafra, and P. De Bra,

Applying Web usage mining for personalizing

hyperlinks in Web-based adaptive educational

systems, Computers & Education, Vol. 53, n. 3,

pp. 828-840, November 2009.

[22] S. Park, N.C. Suresh, and B. K. Jeong, Sequence-

based clustering for Web usage mining: A new

experimental framework and ANN-enhanced K-

means algorithm, Data & Knowledge

Engineering, Vol. 65, n. 3, pp. 512-543, June

2008.

[23] R. Farzan and P. Brusilovsky, Social Navigation

Support in E-Learning: What are the Real

Footprints?, Proceedings of the 3rd Workshop on

Intelligent Techniques for Web Personalization

(ITWP’05), pp. 49-56, 2005.

[24] J. Zhu, J. Hong, and J. G. Hughes, PageCluster:

Mining Conceptul Link Hierarchies from Web

Log Files for Adaptive Web Site Navigation,

ACM transactions on Internet Technologies, ,

Vol. 4, n. 2, pp. 185-208, 2004.

[25] L. Chen, S. S. Bhowmick, and W. Nejdl,

COWES: Web user clustering based on

evolutionary web sessions, Data & Knowledge

Engineering, Vol. 68, n. 10, pp. 867-885,

October 2009,.

[26] B. Mobasher, Data Mining for Web

Personalization, The Adaptive Web, Springer

Lecture Notes in Computer Science, Vol. 4321,

pp. 90-135, 2007.

[27] C. Romero, and S. Ventura, Educational data

mining: A survey from 1995 to 2005, Expert

Systems with Applications, Vol. 33, n. 1, pp. 135-

146, 2007.

[28] T. Kohonen, Self-organized formation of

topological correct feature maps, Biological

Cypernetics, Vol. 43, pp. 59-69, 1982.

[29] K. Etminani, A.R. Delui, N.R. Yanehsari, and M.

Rouhani, Web usage mining: Discovery of the

users' navigational patterns using SOM,

Networked Digital Technologies, NDT '09, pp.

224 – 249, 2009.

[30] C. Wei, W. Sen, Z. Yuan, and C. Lian-Chang,

Algorithm of mining sequential patterns for web

personalization services, SIGMIS Database, Vol.

40, n. 2, pp. 57-66, 2009.

[31] K. A. Smith, and Ng A., Web page clustering

using a self–organizing map of user navigation

patterns, Decision Support Systems, Vol. 35, pp.

245-256, 2003.

[32] G. Pallis, L. Angelis, and A. Vakali, Validation

and interpretation of Web users’ sessions

clusters, Information Processing &

Management, Vol. 43, n. 5, pp. 1348-1367,

September 2007.

[33] M.G.R. Sause, A. Gribov, A.R. Unwin, and S.

Horn, Pattern recognition approach to identify

natural clusters of acoustic emission signals,

Pattern Recognition Letters, Vol. 33, n. 1, pp.

17-23, January 2012.

[34] P. J. Rousseeuw, Silhouettes: a graphical aid to

the interpretation and validation of cluster

analysis, Journal of computational and applied

mathematics, Vol. 20, pp. 53-65, 1987.

[35] C. Shahabi, A.M. Zarkesh, J. Adibi, and V. Shah,

Knowledge discovery from users Web page

navigation, Proceedings of the 7th

international

workshop on research issues in data engineering

. pp. 20-29, 1997.

Authors Information

1Computer Information Systems Department, King Abdullah II School of Information Technology, Jordan University, Amman 11942,

Jordan, E-mail: [email protected].

Ammar M. Huneiti received his BSc, MSc

and PhD degrees from Cardiff University,

UK. His BSc is in Computer Science (1991), his MSc is in Information Systems

Technologies (1992) and his PhD is in

Systems Engineering (2004). Between 1992 and 2000 he worked for several private and

public sector organizations supervising the

design and implementation of IT related projects. Since 2005 and until present, he is an assistant professor at the Department of Computer

Information Systems, King Abdullah II School of Information

Technology, the University of Jordan. His research interests include Intelligent Web Information Systems, Web Data Mining, Adaptive

Hypermedia, and User Modelling.