personalizing the web directories
TRANSCRIPT
-
8/4/2019 Personalizing the Web Directories
1/40
Personalizing Web Directories with the Aid of Web Usage Data
Literature Survey:
Computational intelligence models for Personalization
CI has been defined as the study of adaptive mechanisms to enable or facilitate intelligent
behavior in complex and changing environments. This is an ongoing and evolving area of
research since its term was coined by John McCarthy in 1956. Different CI models related to
personalization are given in figure 1.
Fuzzy Systems (FS) and Fuzzy Logic (FL) mimic the concept the way people think, that is, with
reasoning rather than precise. Fuzzy methods were found to be instrumental in web-based
personalization when used with WUM data. User profiles are processed using fuzzy approximate
reasoning to recommend personalized URLs. Handling of user profiles with fuzzy concepts has
been used by IR systems to provide users with personalized search engine results. Based on users
web usage history data, fuzzy methods have been used to categorize or cluster web objects for
web personalization. Fuzzy logic was used with collective or collaborate data mining techniquesto improve the quality of intelligent agents to provide personalized services to users .
Evolutionary Algorithms (EA) use mechanisms inspired by biological evolution such as
reproduction, mutation, recombination and selection. One of the most popular EA is Genetic
Algorithms (GA). They mimic the gene structure in humans based on evolutionary theory. GA
has been used to address some of the flaws of WUM and to tackle different problems such as
-
8/4/2019 Personalizing the Web Directories
2/40
personalized search, IR, query optimization and document representation. GA was applied with
user log mining techniques to get a better understanding of user preferences and discover
associations between different URL addresses. By GA was included randomness in content
filtering rather than strict adherence to predefined user profiles. This is known as the element of
serendipity in IR. This modified GA was introduced for optimal design of a website based on a
multiple optimization criteria taking download time, visualization and product association level
into consideration. Artificial Neural Networks (ANN) or simply Neural Networks (NN) mimic
the biological process of the human brain. A NN can be trained to group users into specified
categories or into clusters.
This is useful in personalization as each user group may possess similar preferences and
hence the content of a web interface can be adapted to each group. NNs can also be trained to
learn the behavior of website users. Inputs for this learning can be derived from WUM data and
CF techniques. The learning ability of neural networks can also be used for real time adaptive
interaction instead of only common content and static based personalization. A NN was used to
construct user profiles. A NN was implemented to categorize e-mail folder. Swarm Intelligence
(SI) is based on the collective behavior of animals in nature such as birds, ants, bees and wasps.
Particle Swarm Optimization (PSO) models the convergence behavior of a flock of birds. PSO
was used for analyzing unique behavior of web user for manipulation of web access log data and
user profile data. Personalized recommendation based on individual user preferences or CF data
has also been explored using PSO. This was done by building up profiles of users and then using
an algorithm to find profiles similar to the current user by supervised learning. Personalized and
automatic content sequencing of learning objects was implemented using PSO. Research has also
been done using PSO as a clustering algorithm but no use of this approach to clustering was
found in relation to website personalization.
Another SI technique is Ant Colony Optimization (ACO) which models the behavior of
ants that leave the nest to wander randomly in search of food and when it is found they leave a
trail of pheromone when returning to the colony. ACO resulted in the development of the
shortest path optimization algorithms and has applications in routing optimization. ACO has
been used to classify web users in WUM (cAnt-WUM algorithm) allowing personalization of the
web system to each user class. Bees Colony Optimization (BCO) is built on basic principles of
-
8/4/2019 Personalizing the Web Directories
3/40
collective bee intelligence. It has been applied to web-based systems to improve the IR systems
of search engines incorporating WUM data, however the issue of personalization has not yet
known to be directly addressed. Wasp Colony optimization (WCO) or Wasp Swarm
Optimization (WSO) has not yet been exploited in comparison to the other SI methods. It models
the behavior of insect wasps in nature. WCO has also been applied to the NP-hard optimization
problem known as the Multiple Recommendations Problem (MRP). It occurs when several
personalized recommendations are running simultaneously and results in churning where a user
is presented with uninteresting recommendations. Further research has to be done however, using
WCO on real, scalable and dynamic data sets. Artificial Immune Systems (AIS) mimic the
functioning of the human immune system as the body learns to handle antigens by producing
antibodies based in previous experience. Applications of AIS have been solving pattern
recognition problems, classification tasks, cluster data and anomaly detection. Already AIS has
been applied to personalization of web-based systems. The human body is represented by a
website, incoming web requests are antigens and learning is paralleled to the learning of the
immune systems to produce the right antibodies to combat each antigen. Using this analogy and
AIS based on WUM was used as a learning system for a website. It is common practice to
combine CI techniques to create a hybrid which seeks to overcome the weakness of one
technique with the strength of another. Several hybrids were applied to personalization of web
based systems. NN was combined with FL to give a hybrid Neuro -Fuzzy strategy for Web
personalization. The topology and parameters of NN were used to obtain the structure and
parameters of fuzzy rules.
The learning ability of NN was then applied to this set of rules. The ability of
evolutionary techniques such as GA, to extract implicit information from user logs was
combined with fuzzy techniques to include vagueness in decision making. This FL-GA hybrid
allows more accurate and flexible modeling of user preferences. User data obtained from web
usage data is the input for a NN. The weights and fitness functions derived from NN training isoptimized using GA to derive classification rules to govern personalized decision making in e-
Business. A fuzzy-PSO approach was introduced to personalize Content Based Image Retrieval
(CBIR). User logs were analyzed and used as the PSO input. Fuzzy principles were applied to
the PSO velocity, position and weight parameters.
-
8/4/2019 Personalizing the Web Directories
4/40
Personalization of web-based systems using CI models
Based on the eight major CI methods described above, it is noticed that WUM is the common
input for all models. Data mining in a sense provides the fuel for personalization using CI
methods. CI methods are comparable to taxonomy of intelligent agents for personalization.
Building on ideas from this approach taxonomy for personalization of web-based systems was
proposed (cf. Fig. 2). Two main uses are identified for CI methods when applied to
personalization: profile generation and profile exploitation. User profiles can further be used to
personalize either the navigation or content of web based systems.
Profile generation
Profile generation is the creation of user profiles based on both implicit WUM data and explicit
user preferences. User profiles can be generated either per individual or group users which
appear to have similar previous web usage habits using CF techniques. Five CI methods found in
previous work which were applied to user profile generation of web based systems are: FL, NN,
PSO, ACO and AIS. FL models are constructed to identify ambiguity in user preferences
however there are many ways of interpreting fuzzy rules and translating human knowledge into
formal controls can be challenging. NN was trained to identify similarities in user behavior
however for proper training the sample size must be large and the NN can be complex due to
over fitting. Both PSO and GA were used to link users behavior by profile-matching but PSO
was found to outperform GA in terms of speed, execution and accuracy. ACO was used to model
users with relative accuracy and simplicity; however its computational complexity causes long
computing time. PSO approach was found to be faster when compared to ACO. AIS was used to
dynamically adapt profiles to changing and new behaviors. The theoretical concept of AIS is not
fully sound however, since in reality other human systems support the functioning of the immune
system and these are not modeled. The artificial cells in AIS do not work autonomously therefore
the success or fail of one part of the system may determine the performance of the following
step.
-
8/4/2019 Personalizing the Web Directories
5/40
A hybrid method uses GA to optimize the input values of a NN, to maximize the output. In this
way the slow learning process of NN is helped with the optimization ability of GA.
Profile exploitation
Profile exploitation personalizes various aspects of a web-based system by predefined user
profiles. Two main approaches to personalize web based systems were identified as
personalization of navigation and personalization of content (cf. fig.2).
Personalized navigation
Personalized navigation includes WUM for personalized IR, such as search engine results, and
URL recommendations. FL, BCO and GA were three main CI methods found for navigation
personalization (cf. fig.2). FL was used for offline processing to recommend URLs to users. It is
relatively fast, deal with natural overlap in user interests and suitable for real time
recommendations. Various FL testing however showed slightly lower precision and harder to
program for the fuzzy part. GA was applied for search and retrieval but is it known to be more
general and abstract than other optimization methods and does not always provide the optimal
solution. BCO was used for IR but it is not a widely covered area of research and currently there
is a better theoretical than experimental understanding. ACO is similar to BCO and has seen
more successful applications. A hybrid between GA and FL was applied to this area. Fuzzy set
techniques were used for better document modeling and genetic algorithms for query
optimization to give personalized search engine results. A Neuro-Fuzzy method combined the
-
8/4/2019 Personalizing the Web Directories
6/40
learning ability of NN with the representation of vagueness in Fuzzy Systems to overcome the
NN black-box behavior and present more meaningful results than FL alone.
Personalized content
Personalized content refers to WUM for personalized web objects on each web page and
sequence of content. FL, NN, GA, PSO and WCO were the main CI techniques found with
applications in this area (cf. fig.2). FL was used for a web search algorithm and to automate
recommendations to ecommerce customers. It was found to be flexible and able to support
ecommerce application. NN was used to group users into clusters for content recommendations
however over fitting problem still exists today GA was applied to devise the best arrangement of
web objects. It was found to be scalable; however it is suggest to be used in collaboration with
other data mining tools. PSO was used to sequence Learning Objects and was chosen because of
relative small number of parameters compared with other techniques such as GA. PSO parameter
selection is also a well researched area. Using a modified PSO for data clustering was found to
give accurate results. WCO was applied on the churning problem of uninteresting content
recommendations to users. This is mostly a theoretical concept, not well tested on real data and
other biological inspired algorithms have found more success such as ACO. Fuzzy-PSO was
created to help improve the effectiveness of standard PSO particle movement in a content based
system.
PROBABILISTIC LATENT SEMANTIC MODELS OF WEB USER NAVIGATIONS
The overall process of Web usage mining consists of three phrases: data preparation and
transformation, pattern discovery, and pattern analysis. The data preparation phase transforms
raw Web log data into transaction data that can be processed by various data mining tasks. In the
pattern discovery phase, a variety of data mining techniques, such as clustering, association rule
mining, and sequential pattern discovery can be applied to the transaction data. The discovered
patterns may then be analyzed and interpreted for use in such applications as Webpersonalization. The usage data preprocessing phase [8, 32] results in a set of n page views, P =
{p1, p2, . . . , pn} and a set of m user sessions, U = {u1, u2, . . . , um}. A page view is an
aggregate representation of a collection of Web objects (e.g. pages) contributing to the display on
a users browser resulting from a single user action (such as a click through, product purchase, or
database query). The Web session data can be conceptually viewed as an m n session-page
-
8/4/2019 Personalizing the Web Directories
7/40
view binary matrix UP = [w(ui, pj )]mn, where w(ui, pj) represents the weight of page view pj
in a user session ui. The weights can be binary, representing the existence or non-existence of the
page view in the session, or they may be a function of the occurrence or duration of the page
view in that session. PLSA is a latent variable model which associates hidden (unobserved)
factor variable Z = {z1, z2, ..., zl} with observations in the co-occurrences data. In our context,
each observation corresponds to an access by a user to a Web resource in a particular session
which is represented as an entry of the m n co-occurrence matrix UP. The probabilistic latent
factor model can be described as the following generative model:
1. select a user session ui from U with probability Pr(ui),
2. pick a latent factor zk with probability Pr(zk|ui),
3. Generate a page view pj from P with probability Pr(pj|zk).
As a result we obtain an observed pair (ui, pj ), while the latent factor variable zk is
discarded. Translating this process into a joint probability model results in the following:
Summing over all possible choices of zk from which the observation could have been generated.
Using Bayes rule, it is straightforward to transform the joint probability into:
Now, in order to explain a set of observations (U, P), we need to estimate the parameters Pr(zk),
Pr(ui|zk), Pr(pj |zk), while maximizing the following likelihood L(U, P) of the observations,
-
8/4/2019 Personalizing the Web Directories
8/40
Expectation-Maximization (EM) algorithm is a well known approach to performing maximum
likelihood parameter estimation in latent variable models. It alternates two steps:
(1) an expectation (E) step where posterior probabilities are computed for latent variables, based
on the current estimates of the parameters,
(2) a maximization (M) step, re-estimate the parameters in order to maximize the expectation of
the complete data likelihood. The EM algorithm begins with some initial values of Pr(zk),
Pr(ui|zk), and Pr(pj |zk). In the expectation step we compute:
In the maximization step, we aim at maximizing the expectation of the complete data likelihood
E(LC),
While taking into account the constraints, l k=1 Pr(zk) = 1, on the factor probabilities, as well as
the following constraints on the two conditional probabilities:
-
8/4/2019 Personalizing the Web Directories
9/40
Through the use of Lagrange multipliers (see for details), we can solve the constraint
maximization problem to get the following equations for re-estimated parameters:
Iterating the above computation of expectation and maximization steps monotonically increases
the total likelihood of the observed data L(U, P) until a local optimal solution is reached. The
computational complexity of this algorithm is O(mnl), where m is the number of user sessions, n
is the number of page views, and l is the number of factors. Since the usage observation matrix
is, in general, very sparse, the memory requirements can be dramatically reduced using efficient
sparse matrix representation of the data.
DISCOVERY AND ANALYSIS OF USAGE PATTERN WITH PLSA
One of the main advantages of PLSA model in Web usage mining is that it generates
probabilities which quantify relationships between Web users and tasks, as well as Web pages
and tasks. From these basic probabilities, using probabilistic inference, we can derive
relationships among users, among pages, and between users and pages. Thus this framework
provides a flexible approach to model a variety of types of usage patterns. In this section, we will
describe various usage patterns that can be derived using the PLSA model. As noted before, the
PLSA model generates probabilities Pr(zk), which measures the probability of a certain task is
chosen; Pr(ui|zk), the probability of observing a user session given a certain task; and Pr(pj |zk),
the probability of a page being visited given a certain task. Applying Bayes rule to these
probabilities, we can generate the probability that a certain task is chosen given an observed user
session:
-
8/4/2019 Personalizing the Web Directories
10/40
and the probability that a certain task is chosen given an observed page view:
In the following, we discuss how these models can be used to derive different kinds of usage
patterns. We will provide several illustrative examples of such patterns, from real Web usage
data, in Section 4.
Characterizing Tasks by Page views or by User Sessions
Capturing the tasks or objectives of Web users can help the analyst to better understand these
users preferences and interests. Our goal is to characterize each task, represented by a latent
factor, in a way that is easy to interpret. One possible approach is to find the prototypical pages
that are strongly associated with a given task, but that are not commonly identified as part of
other tasks. We call each such page a characteristic page for the task, denoted by pch. This
definition of prototypical has two consequences; first, given a task, a page which is seldom
visited cannot be a good characteristic page for that task. Secondly, if a page is frequently visited
as part of a certain task, but is also commonly visited in other tasks, the page is not a good
characteristic page. So we define characteristic pages for a task zk as the set of all pages, pch,
which satisfy:
Where is a predefined threshold. By examining the characteristic pages of each task, we can
obtain a better understanding of the nature of these tasks. Characterizing tasks in this way can
lead to several applications. For example, most Web sites allow users to search for relevant
pages using keywords. If we also allow users to explicitly express their intended task(s) (via
inputting task descriptions or choosing from a task list), we can return the characteristic pages for
the specified task(s), which are likely to lead users directly to their objectives. A similar
-
8/4/2019 Personalizing the Web Directories
11/40
approach can be used to identify prototypical user sessions for each task. We believe that a
user session involving only one task can be considered as the characteristic session for the task.
So, we define the characteristic user sessions, uch, for a task, zk, as sessions which satisfy
where is a predefined threshold. When a user selects a task, returning such exemplar sessions
can provide a guide to the user for accomplishing the task more efficiently. This approach can
also be used in the context of collaborative filtering to identify the closest neighbors to a user
based on the tasks performed by that user during an active session.
User Segments Identification
Identifying Web user groups or segments is an important problem in Web usage mining. It helps
Web site owners to understand and capture users common interests and preferences. We can
identify user segments in which users perform common or similar task, by making inferences
based on the estimated conditional probabilities obtained in the learning phase. For each task zk,
we choose all user sessions with probability Pr(ui|zk) exceeding a certain threshold to get a
session set C. Since each user sessions, can also be represented as a page view vector, we can
further aggregate these users sessions into a single page views vector to facilitate interpretation.
The algorithm of generating user segments is as follows:
1. Input: Pr(ui|zk), user session-page matrix UP and threshold .
2. For each zk, choose all the sessions with Pr(ui|zk) to get a candidate session set C.
3. For each zk, compute the weighed average of all the chosen sessions in set Cto get a page
vector defined as:
-
8/4/2019 Personalizing the Web Directories
12/40
4. For each factor zk, output page vector This page vector consists of a set of weights, for
each page view in P, which represents the relative visit frequency of each page view for this user
segment. We can sort the weights so that the top items in the list correspond to the most
frequently visited pages for the user segment. These user segments provide an aggregate
representation of all individual users navigational activities in the a particular group. In addition
to their usefulness in Web analytics, user segments also provide the basis for automatically
generating item recommendations. Given an active user, we compare her activity to all user
segments and find the most similar one. Then, we can recommend items (e.g., pages) with
relatively high weights in the aggregate representation of the segment. In Section 4, we conduct
experimental evaluation of the user segments generated from two real Web sites.
Identifying the Underlying Tasks of a User Session
To better understand the preferences and interests of a single user, it is necessary to identify the
underlying tasks performed by the user. The PLSA model provides a straightforward way to
identify the underlying tasks in a given user session. This is done by examining Pr(task|session),
which is the probability of a task being performed, given the observation of a certain user
session. For a user session u, we select the top tasks zk with the highest Pr(zk|u) values, as the
primary task(s) performed by this user. For a new user session, unew, not appearing in the
historical navigational data, we can adopt a folding-in method as introduced in to generatePr(task|session) via the EM algorithm. In the E-step, we compute
Here, w(unew, p) represents the new users visit frequency on the specified page p. After we
generate these probabilities, we can use the same method to identify the primary tasks for the
new user session. The identification of the primary tasks contained in user sessions can lead to
further analysis. For example, after identifying the tasks in all user sessions, each session u can
be transformed into a higher-level representation,
-
8/4/2019 Personalizing the Web Directories
13/40
where zi denotes task i and wi denotes Pr(zi|u). This, in turn, would allow the discovery and
analysis of task-level usage patterns, such as determining which tasks are likely to be visited
together, or which tasks are most (least) popular, etc. Such higher-level patterns can help site
owners better evaluate the Web site organization.
Integration of Usage Patterns with Web Content Information
Recent studies have emphasized the benefits of integrating semantic knowledge about the
domain (e.g., from page content features, relational structure, or domain ontologies) in the Web
usage mining process. The integration of content information about Web objects with usage
patterns involving those objects provides two primary advantages. First, the semantic
information provides additional clues about the underlying reasons for which a user may or may
not be interested in particular items. Secondly, in cases where little or no rating or usage
information is available (such as in the case of newly added items, or in very sparse data sets),
the system can still use the semantic information to draw reasonable conclusions about user
interests. The PLSA model described here also provides an ideal and uniform framework for
integrating content and usage information. Each page view contains certain semantic knowledge
represented by the content information associated with that page view.
By applying text mining and information retrieval techniques, we can represent each page
view as an attribute vector. Attributes may be the keywords extracted from the page views, or
structured semantic attributes of the Web objects contained in the page views. As before, we
assume there exists a set of hidden factors z Z = {z1, z2, ..., zl}, each of which represents a
semantic group of pages. They can be a group of pages which have similar functionalities for
users performing a certain task, or a group of pages which contain similar content information or
semantic attributes. However, now, in addition to the set of page views, P, and the set of usersessions, U, we also specify a set of t semantic attributes, A = {a1, a2, . . . , at}. To model the
user-page observations, we use
-
8/4/2019 Personalizing the Web Directories
14/40
These models can then be combined based on the common component Pr(pj |zk). This can be
achieved by maximizing the following log-likelihood function with a predefined weight .
where is used to adjust the relative weights of two observations. The EM algorithm can again
be used to generate estimates for Pr(zk), Pr(ui|zk), Pr(pj |zk), and Pr(aq|zk). By applying
probabilistic inferences, we can measure the relationships among users, pages, and attributes,
thus we are able to answer questions such as, What are the most important attributes for a group
of users?, or Given an Web page with a specified set of attributes, will it be of interest to a
given user?, and so on.
EXPERIMENTS WITH PLSA MODEL
In this section, we use two real data sets to perform experiments with our PLSA-based Web
usage mining framework. We first provide several illustrative examples of characterizing users
tasks, as introduced in the previous section, and of identifying the primary tasks in an individual
user session. We then perform two types of evaluations based on the generated user segments.
First we evaluate individual user segments to determine the degree to which they represent
activities of similar user. Secondly, we evaluate the effectiveness of these user segments in the
context of generating automatic recommendations. In each case, we compare our approach with
the standard clustering approach for the discovery of Web user segments. In order to compare the
clustering approach to the PLSAbased model, we adopt the algorithm presented in for creating
aggregate profiles based on session clusters. In the latter approach, first, we apply a
-
8/4/2019 Personalizing the Web Directories
15/40
multivariate clustering technique such as k-means to user-session data in order to obtain a set of
user clusters TC = {c1, c2, ..., ck}; then, an aggregate representation, prc, is generated for each
cluster c as a set of page view-weight pairs:
where the significance weight, weight is given by weight(p, prc) = (1/|c|)uc w(p, u) and
w(p, u) is the weight of page view p of the user session u c. Thus, each segment is represented
as a vector in the page view space. In the following discussion, by a user segment, we mean its
aggregate representation as a page view vector.
Data Sets
In our experiments, we use Web server log data from two Web sites. The first data set is based
on the server log data from the host Computer Science department. This Web site provide
various functionalities to different types of Web users. For example, prospective students can
obtain program and admissions information or submit online applications. Current students can
browse course information, register for courses, make appointments with faculty advisors, and
log into the Intranet to do degree audits. Faculty can perform student advising functions online or
interact with the faculty Intranet. After data preprocessing, we identified 21,299 user sessions
(U) and 692 pageviews (P), with each user session consisting of at least 6 pageviews. This data
set is referred to as the CTI data. The second data set is from the server logs of a local affiliate
of a national real estate company. The primary function of the Web site is to allow prospective
buyers to visit various pages and information related to some 300 residential properties. The
portion of the Web usage data during the period of analysis contained approximately 24,000 user
sessions from 3,800 unique users. During preprocessing, we recorded each user-property pair
and the corresponding visit frequency. Finally, the data was filtered to limit the final data set to
those users that had visited at least 3 properties. In our final data matrix, each row represented a
user vector with properties as dimensions and visit frequencies as the corresponding dimension
values. We refer to this data set as the Realty data. Each data set was randomly divided into
multiple training and test sets to use with 10-fold cross-validation. By conducting sensitivity
analysis, we chose 30 factors in the case of CTI data and 15 factors for the Realty data. To avoid
overtraining, we implemented the Tempered EM algorithm to train the PLSA model.
-
8/4/2019 Personalizing the Web Directories
16/40
Examples Usage Patterns Based on the PLSA Models
Figure 1 depicts an example of the characteristic pages for a specific discovered task in the CTI
data. The first 6 pages have the highest Pr(page|task)Pr(task|page) values, thus are considered as
the characteristic pages of this task. Observing these characteristic pages, we may infer that this
task corresponds to prospective students who are completing an online admissions application.
Here characteristic has two implications. First, if a user wants to perform this task, he/she must
visit these pages to accomplish his/her goal. Secondly, if we find a user session contains these
pages, we can claim the user must have performed online application. Some page may not be
characteristic pages for the task, but may still be useful for the purpose of analysis. An example
of such a page is the /news/ page which has a relatively high Pr(page|task) value, and a low
-
8/4/2019 Personalizing the Web Directories
17/40
Pr(task|page) value. Indeed, by examining the at the site structure, we found that this page serves
as a navigational page, and it can lead users to different sections of the site to perform different
tasks (including the online application). This kind of discovery can help Web site designer to
identify the functionalities of pages and reorganize Web pages to facilitate users navigation.
Figure 2 identifies three tasks in the Realty data. In contrast to the CTI data, in this data set the
tasks represent common real estate properties visited by users, thus reflecting user interests in
similar properties. The similarities are clearly observed when property attributes are shown for
each characteristic page. From the characteristic pages of each task, we infer that Task 4
represents users interest in newer and more expensive properties, while Task 0 reflects interest
in older and very low priced properties. Task 5 represents interest in properties midrange prices.
We can also identify prototypical users corresponding to specific tasks. An example of such a
user session is depicted in Figure 3 corresponding to yet another task in the realty data which
reflects interest in very high priced and large properties (task not shown here).
-
8/4/2019 Personalizing the Web Directories
18/40
Our final example is this section shows how the prominent tasks contained in a given user
session can be identified. Figure 4 depicts a random user session from CTI data. Here we only
show the tasks IDs which have the highest probabilities Pr(task|session). As indicated, the
dominant tasks for this user session are Tasks 3 and 25. The former is, in fact, the online
application task discussed earlier, and the latter is a task that represents international students
who are considering applying for admissions. It can be easily observed that, indeed, this session
seems to identify an international student who, after checking admission and visa requirements,
has applied for admissions online.
-
8/4/2019 Personalizing the Web Directories
19/40
Evaluation of User Segments and Recommendations
We used two metrics to evaluate the discovered user segments. The first is called the Weighted
Average Visit Percentage (WAVP). WAVP allows us to evaluate each segment individually
according to the likelihood that a user who visits any page in the segment will visit the rest of the
pages in that segment during the same session. Specifically, let T be the set of transactions in the
evaluation set, and for a segment s, let Ts denote a subset of T whose elements contain at least
one page from s. The weighted average similarity to the segment s over all transactions is then
computed (taking both the transactions and the segments as vectors
-
8/4/2019 Personalizing the Web Directories
20/40
Note that a higher WAVP value implies better quality of a segment in the sense that the segment
represents the actual behavior of users based on their similar activities. For evaluating the
recommendation effectiveness, we use a metric called Hit Ratio in the context of top-N
recommendation.
For each user session in the test set, we took the first K pages as a representation of an
active session to generate a top-N recommendation set. We then compared the recommendations
with the pageview (K +1) in the test session, with a match being considered a hit. We define the
Hit Ratio as the total number of hits divided by the total number of user sessions in the test set.
Note that the Hit Ratio increases as the value of N (number of recommendations) increases.
Thus, in our experiments, we pay special attention to smaller number recommendations (between
1 and 20) that result in good hit ratios. Note that a higher WAVP value implies better quality of a
segment in the sense that the segment represents the actual behavior of users based on their
similar activities. For evaluating the recommendation effectiveness, we use a metric called Hit
Ratio in the context of top-N recommendation. For each user session in the test set, we took the
first K pages as a representation of an active session to generate a top-N recommendation set.
We then compared the recommendations with the page view (K +1) in the test session, with a
match being considered a hit. We define the Hit Ratio as the total number of hits divided by the
total number of user sessions in the test set. Note that the Hit Ratio increases as the value of N
(number of recommendations) increases. Thus, in our experiments, we pay special attention tosmaller number recommendations (between 1 and 20) that result in good hit ratios.
-
8/4/2019 Personalizing the Web Directories
21/40
In the first set of experiments we compare the WAVP values for the generated segments using
the PLSA model and those generated by the clustering approach. Figures 5 and 6 depict theseresults for the CTI and Realty data sets, respectively. In each case, the segments are ranked in the
decreasing order of WAVP. The results show clearly that the probabilistic segments based on the
latent factor factors provides a significant advantage over the clustering approach. In the second
set of experiments we compared the recommendation accuracy of the PLSA model with that of
kmeans clustering segments. In each case, the recommendations are generated according to the
-
8/4/2019 Personalizing the Web Directories
22/40
recommendation algorithm presented in Section 3.2. The recommendation accuracy is measured
based on hit ratio for different number of generated recommendations. These results are depicted
in Figures 7 and 8 for the CTI and Realty data sets, respectively. Again, the results show a clear
advantage for the PLSA model. In most realistic situations, we are interested in a small, but
accurate, set of recommendations. Generally, a reasonable recommendation set might contain 5
to 10 recommendations. Indeed, this range of values seem to represent the largest improvements
of the PLSA model over the clustering approach.
ODP: The Open DirectoryProject
Description. The DMOZ Open Directory Project (ODP) [20] is the largest, most
comprehensive human-edited web page catalog currently available. It covers 4 million sites filed
into more than 590,000 categories (16 wide-spread top-categories, such as Arts, Computers,
News, Sports, etc.) Currently, there are more than 65,000 volunteering editors maintaining it.
ODPs data structure is organized as a tree, where the categories are internal nodes and pages are
leaf nodes. By using symbolic links, nodes can appear to have several parent nodes. Since ODP
truly is free and open, everybody can contribute or re-use the dataset, which is available in RDF
(structure and content are available separately). Google for example uses ODP as basis for its
Google Directory service.
Applications
Besides its re-use in other directory services, the ODP taxonomy is used as a basis for various
other research projects. In Persona, ODP is applied to enhance HITS with dynamic user profiles
using a tree coloring technique (by keeping track of the number of times a user has visited
pages of a specific category). Users can rate a page as being good or unrelated regarding
their interest. This data is then used to rank and omit interesting/unwanted results. While asks
users for feedback, we only rely on user profiles, i.e., a one-time user interaction. More, we do
not develop our search algorithm on top of HITS, but on top of any search algorithm, as a
refinement. In, a similar approach using the ODP taxonomy is applied onto a recommender
system of research papers. The Open Directory can also be used as a reference source containing
good pages, to fight web spam containing uninteresting URLs through white listing, as a web
corpus for comparisons of rank algorithms, as well as for focused crawling towards special-
-
8/4/2019 Personalizing the Web Directories
23/40
interest pages. Unfortunately, the free availability of ODP also has its downside. A clone of the
directory modified to contain some spam pages could trap people to link to this fake directory,
which results in an increased ranking not only for this directory clone, but also for the injected
spam pages.
Page Rank and Personalized Page Rank
Page Rank computes Web page scores based on the graph inferred from the link structure of the
Web. It is based on the idea that a page has high rank if the sum of the ranks of its back links is
high. Given a page p, its input I(p) and output O(p) sets of links, the Page Rank formula is:
The dampening factor c < 1 (usually 0.15) is necessary to guarantee convergence and to limit the
effect of rank sinks [2]. Intuitively, a random surfer will follow an outgoing link from the current
page with probability (1 c) and will get bored and select a random page with probability c (i.e.,
the E vector has all entries equal to 1/N, where N is the number of pages in the Web graph).
Initial steps towards personalized page ranking are already described by who proposed a slight
modification of the above presented algorithm to redirect the random surfer towards preferred
pages using the E vector. Several distributions for this vector have been proposed since.
Topic-sensitive Page Rank
Haveliwala builds a topic oriented Page Rank, starting by computing off-line a set of 16 Page-
Rank vectors biased on each of the 16 main topics of the Open Directory Project. Then, the
similarity between a user query and each of these topics is computed, and the 16 vectors are
combined using appropriate weights. Personalized Page Rank. A more recent investigation, uses
a different approach: it focuses on user profiles. One Personalized Page Rank Vector (PPV) is
computed for each user. The personalization aspect of this algorithm stems from a set of hubs
(H)1, each user having to select her preferred pages from it. PPVs can be expressed as a linear
combination of PPVs for preference vectors with a single non-zero entry corresponding to each
-
8/4/2019 Personalizing the Web Directories
24/40
of the pages from the preference set (called basis vectors). The advantage of this approach is that
for a hub set of N pages, one can compute 2N Personalized Page Rank vectors without having to
run the algorithm again, unlike, where the whole computation must be performed for each
biasing set. The disadvantages are forcing the users to select their preference set only from
within a given group of pages (common to all users), as well as the relatively high computation
time for large scale graphs.
USING ODP METADATA FOR PERSONALIZED SEARCH
Motivation. We presented in Section 2.2 the most popular approaches to personalizing Web
search. Even though they are the best so far, they all have some important drawbacks. In, we
need to run the entire algorithm for each preference set (or biasing set), which is practically
impossible in a large-scale system. At the other end, computes biased PageRank vectors limited
only to the broad 16 top-level categories of the ODP, because of the same problem. Improves
this somewhat, allowing the algorithm to bias on any subset of a given set of pages (H).
Although work has been done in the direction of improving the quality of this latter set [4], one
limitation is still that the preference set is restricted to a subset of this given set H (if H = {CNN,
FOX News} we cannot bias on MSNBC for example). More importantly, the bigger H is, the
more time is needed to run the algorithm. Thus finding 1Note that hubs were defined here as
pages with high Page Rank, differently from the more popular definition from.
a simpler and faster algorithm with at least similar personalization granularity is still a worthy
goal to pursue. In the following we make another step towards this goal. Introduction. Our first
step was to evaluate how ODP search compares with Google search, specifically exploiting the
fact that all ODP entries are categorized into the ODP topic hierarchy. We started with the
following two observations: 1. given the fact that ODP just includes 4 million entries, and the
Google database includes 8 billion, does ODP-based search stand a chance of being comparable
to Google? 2. ODP advanced search offers a rudimentary personalized search feature by
restricting the search to the entries of just one of the 16 main categories. Google directory offers
a related feature, by offering to restrict search to a specific category or subcategory. Can we
improve this personalized search feature, taking the user profile into account in a more
-
8/4/2019 Personalizing the Web Directories
25/40
sophisticated way, and how does such an enhanced personalized search on the ODP or Google
entries compare to ordinary Google results? Most people would probably answer (1) No, not
yet, and (2) Yes. In the following Section we will prove the correctness of the second answer
by introducing a new personalized search algorithm, and then we will concentrate on the first
answer in the experiments Section.
Algorithm
Our algorithm is exploiting the annotations accumulated in generic large-scale annotations such
as the Open Directory. Even though we concentrate our forthcoming discussion on ODP,
practically any similar taxonomy can be used. These annotations can be easily used to achieve
personalization, and can also be combined with the initial Page Rank algorithm. We define user
profiles using a simple approach: each user has to select several topics from the ODP, which best
fit her interests. For example, a user profile could look like this:
Then, at run-time, the output given by a search service (from Google, ODP Search, etc.) is re-
sorted using a calculated distance from the user profile to each output URL. The execution is
also depicted in Algorithm 3.1.
-
8/4/2019 Personalizing the Web Directories
26/40
Distance Metrics When performing search on Open Directory, each resulting URL comes with
an associated ODP topic. Similarly, a good amount of the URLs output by Google is connected
to one or more topics within the Google Directory (almost 50%, as discussed in Section 3.2).
Therefore, in both cases, for each output URL we are dealing with two sets of nodes from the
topic tree: (1) Those representing the user profile (set A), and (2) those associated with the URL
(set B). The distance between these sets can then be defined as the minimum distance between
all pairs of nodes given by the Cartesian product A B. Finally, there are quite a few
possibilities to define the distance between two nodes. Even though, as we will see from the
experiments, the simplest approaches already provide very good results, we are now performing
an optimality study2 to determine which metric best fits this kind of search. In the following, we
will present our best solutions so far. Nave Distances. The simplest solution is the minimum
tree distance, which, given two nodes a and b, returns the sum of the minimum number of tree
edges between a and the sub sumer (the deepest node common to both a and b) plus the
minimum number of tree edges between b and the subsumer (i.e., the shortest path between a and
b). On the example from Figure 1, the distance between /Arts/Architecture and
/Arts/Design/Interior Design/Events/Competitions is 5, and the subsumer is /Arts. If we also
consider the inter-topic links from the Open Directory, the simplest distance becomes the graph
-
8/4/2019 Personalizing the Web Directories
27/40
shortest path between a and b. For example, if there is a link between Interior Design and
Architecture in Figure 1, then the distance between Competitions and Architecture is 3. This
solution implies to load either the entire topic graph or all the inter-topic links into memory.
Furthermore, its utility is subjective from user to user: the existence of a link between
Architecture and Interior Design does not always imply that a famous architect (one level below
in the tree) is very close to the area of interior design. We can consider these links in our metric
in three ways: 1. Consider the graph containing all intra-topic links and output the shortest path
between a and b. 2. Consider graph containing only the intra-topic links directly connected to a
and b and output the shortest path. 2We refer the reader to for an in-depth view of the approach
we took in this study. 3. If there is an intra-topic link between a and b, output 1. Otherwise,
ignore all intra-topic links and output the tree distance between a and b. Complex Distances. The
main drawback of the above metrics comes from the fact that they ignore the depth of the
subsumer. The bigger this depth is, the more related are the nodes (i.e., the concepts represented
by them). This problem is solved by, which investigates ten intuitive strategies for measuring
semantic similarity between words using hierarchical semantic knowledge bases such as Word
Net [18]. Each of them was evaluated experimentally on a group of testers, the best one having a
0.9015 correlation between the human judgment and the following formula:
The parameters are as follows:
and were defined as 0.2 and 0.6 respectively, h is the tree-depth of the sub sumer, and l is the
semantic path length between the two words. Considering we have several words attached to
each concept and sub-concept, then l is 0 if the two words are in the same concept, 1 if they are
in different concepts, but the two concepts have at least one common word, or the tree shortest
path if the words are in different concepts which do not contain common words. Although thismeasure is very good for words, it is not perfect when we apply it to the Open Directory topical
tree because it does not make a difference between the distance from a (the profile node) to the
subsumer, and the distance from b (the output URL) to the subsumer. Consider node a to be
/Top/Games and b to be /Top/Computers/Hardware/Components/Processors/x86. A teenager
interested in computer games (level 2 in the ODP tree) could be very satisfied receiving a page
-
8/4/2019 Personalizing the Web Directories
28/40
about new processors (level 6 in the tree) which might increase his gaming quality. On the other
hand, the opposite scenario (profile on level 6 and output URL on level 2) does not hold any
more, at least not to the same extent: a processor manufacturer will generally be less interested in
the games existing on the market. This leads to our following extension of the above formula:
with l1 being the shortest path from the profile to the subsumer, l2 the shortest path from the
URL to the subsumer, and a parameter in [0, 1]. Combining the Distance Function with Google
Page Rank. And yet something is still missing. If we use Google to do the search and then sort
the URLs according to the Google Directory taxonomy, some high quality pages might be
missed (i.e., those which are top ranked, but which are not in the directory). In order to integrate
that, the above formula could be combined with the Google Page Rank. We propose the
following approach:
Conclusion. Human judgment is a non-linear process over information sources, and therefore it
is very difficult (if not impossible) to propose a metric which is in perfect correlation to it. A
thorough experimental analysis of all these metrics (which we are currently performing, but
which is outside the scope of this paper) could give us a good enough approximation. In the next
Section we will present some experiments using the simple metric presented first, and show that
it already yields quite reasonable improvements.
-
8/4/2019 Personalizing the Web Directories
29/40
Experimental Results
To evaluate the benefits of our personalization algorithm, we interviewed 17 of our colleagues
(researchers in different computer science areas, psychologists, pedagogues and designers),
asking each of them to define a user profile according to the Open Directory topics (see Section
3.1 for an example profile), as well as to choose three queries of the following types: One clear
query, which they knew to have one or maximum two meanings3 One relatively ambiguous
query, which they knew to have two or three meanings One ambiguous query, which they knew
to have at least three meanings, preferably more We then compared test results using the
following four types of Web search: 1. Plain Open Directory Search 2. Personalized Open
Directory Search, using our algorithm from Section 3.1 to reorder the top 1000 results returned
by the ODP Search 3. Google Search, as returned by the Google API [8]
Personalized Google Search, using our algorithm from Section 3.1 to reorder the top 100 URLs
returned by the Google API4, and having as input the Google Directory topics returned by the
API for each resulting URL. For each algorithm, each tester received the top 5 URLs with
respect to each type of query, 15 URLs in total. All test data was shuffled, such that testers were
neither aware of the algorithm, nor of the ranking of each assessed URL. We then asked the
subjects to rate each URL from 1 to 5, 1 defining a very poor result with respect to their profile
and expectations (e.g., topic of the result, content, etc.) and 5 a very good one5. Finally, for each
sub-set of 5 URLs we took the average grade as a measure of importance attributed to that pair.
The average values for all users and for each of these pairs can be found in table 1,
together with the averages over all types of queries for each algorithm. We of course expected
the plain ODP search to be significantly worse than the Google search, and that was the case:
an average of 2.41 points for ODP versus the 2.76 average received by Google. Also predictable
was the dependence of the grading on the query type. If we average the values on the threecolumns representing each query type, we get 2.54 points for ambiguous queries, 2.91 for semi-
ambiguous ones and 3.25 for clear ones - thus, the clearer was the query, the better rated were
the URLs returned. Personalized Search using ODP. But the same table 1 also provides us with a
more surprising result: The personalized search algorithm is clearly better than Google search,
regardless whether we use Open Directory or Google Directory as taxonomy. Therefore, a
-
8/4/2019 Personalizing the Web Directories
30/40
personalized search on a well-selected set of 4 million pages often provides better results than a
non-personalized one over a 8 billion set. This a clear indicator that taxonomy-based result
sorting is indeed very useful. For the ODP experiments, only our clear queries did not receive a
big improvement, mainly because for some of
these queries ODP contains less than 5 URLs matching both the query and the topics expressed
in the user profile. Personalized Search using Google. Similarly, personalized search using
Google Directory was far better than the usual Google search. We would have expected it to be
even better than the ODP based personalized search, but results were probably negatively
influenced by the fact that the ODP experiments were run on 1000 results, whereas the Google
Directory ones only on 100, due to the limited number of Google API licenses we had. The
grading results are summarized in Figure 2. Generally, we can conclude that personalization
significantly increases output quality for ambiguous and semi-ambiguous queries. For clear
queries, one should prefer Google to Open Directory search, but also Google Directory search to
-
8/4/2019 Personalizing the Web Directories
31/40
the plain Google search. Also, the answers we sketched in the beginning of this Section proved
to be true: Google search is still better than Open Directory search, but we provided a
personalized search algorithm which outperforms the existing Google and Open Directory search
capabilities. Another interesting result is that 40.98% of the top 100 Google pages were also
contained in the Google Directory. More specifically, for the ambiguous queries 48.35% of the
top pages were in the directory, for the semi-ambiguous ones 41.35%, and for the clear ones
33.23%6. Finally, let us add that we performed statistical significance tests7 on our experiments,
obtaining the following results: Statistical significance with an error rate below 1% for the
algorithm criterion, i.e., there is significant difference between each algorithm grading. An
error rate below 25% for the query type criterion, i.e., the difference between the average
grades with respect to query types is less statistically significant. Statistical significance with an
error rate below 5% for the inter-relation between query type and algorithm, i.e., the re-
EXTENDING ODP ANNOTATIONS TO THE WEB
In the last Section we have shown that using ODP entries and their categorization directly for
personalized search turns out to be amazingly good. Can this huge annotation effort invested in
the ODP project (with 65,000 volunteers participating in building and maintaining the ODP
database) be extended to the rest of the Web? This would be useful if we want to find less highly
rated pages not contained in the directory. Just extending the ODP effort does not scale, because
first, significantly increasing the number of volunteers seems improbable, and second, extending
the selection of ODP entries to a larger percentage obviously becomes harder and less rewarding
once we try to include more than just the most important pages for a specific topic. We start
-
8/4/2019 Personalizing the Web Directories
32/40
with the following questions: Given that Page Rank for a large collection of Web pages can be
biased towards a smaller subset, can this be done with sets of ODP entries corresponding to
given categories / subcategories as well? Specifically, ODP entries consist of many of the
most important entries in a given category. Do we have enough entries for each topic such that
biasing on these entries makes a difference?
When does biasing make a difference?
One of the most important work investigating Page Rank biasing is. It first uses the 16 top levels
of the ODP to bias Page- Rank on and then provides a method to combine these 16 resulting
vectors into a more query-dependant ranking. But what if we would like to use one or several
ODP (sub-)topics to compute a Personalized Page Rank vector? More general, what if we would
like to achieve such a personalization by biasing Page Rank towards some generic subset of
pages from the current Web crawl we have? Many authors have used such biasing in their
algorithms. Yet none have studied the boundaries of this personalization, the characteristics the
biasing set has to exhibit in order to obtain relevant results (i.e., rankings which are different
enough from the non-biased Page Rank). We will investigate this in the current Section. Once
these boundaries are defined, we will use them to evaluate (some of) the biasing sets available
from ODP in Section 4.2. First, let us establish a characteristic function for biasing sets, which
we will use as parameter determining the effectiveness of biasing. Pages in the World Wide Web
can be characterized in quite a few ways.
The simplest of them is the out-degree (i.e., total number of out-going links), based on
the observation that if biasing is targeted to such a page, the newly achieved increase in Page
Rank score will be passed forward to all its out-neighbors (pages to which it points). A more
sophisticated version of this measure is the hub value of pages. Hubs were initially defined in
and are pages pointing to many other high quality pages. Reciprocally, high quality pages
pointed to by many hubs are called authorities. There are several algorithms for calculating thismeasure, the most common ones being HITS and its more stable improvements SALSA and
Randomized HITS. Yet biasing on better hub pages will have less influence on the rankings
because the vote a page gives is propagated to its out-neighbors divided by its out-degree.
Moreover, there is also an intuitive reason against this measure: Page Rank biasing is usually
performed to achieve some degree of personalization and people tend to prefer highly valued
-
8/4/2019 Personalizing the Web Directories
33/40
authorities to highly valued hubs. Therefore, a more natural measure is an authority-based one,
such as the non-biased Page Rank score of a page. Even though most of the biasing sets consist
of high Page Rank pages, in order to make this analysis complete we have run our experiments
on different choices for these sets, each of which must be tested with different sizes. For
comparison to Page Rank, we used two degrees of similarity between the non-biased Page Rank
and each resulting biased vector of ranks. They are defined in as follows: 1. O Sim indicates the
degree of overlap between the top n elements of two ranked lists and It is defined as
KSim is a variant of Kendalls T distance measure. Unlike OSim, it measures the degree of
agreement between the two ranked lists. If U is the union of items in and and is U \
then let be the extension of containing appearing after all items in . Similarly,
is defined as an extension of . Using these notations, KSim is defined as follows:
Even though used n = 20, we chose n to be 100, after experimenting with both values and
obtaining more stable results with the latter value. A general study of different similarity
measures for ranked lists can be found in. Let us start by analyzing the biasing on high quality
pages (i.e., with a high Page Rank). We consider the most common set to contain pages in the
range [0 10]% of the sorted list of Page- Rank scores. We varied the sum of scores within this
set between 0.00005% and 10% of the total sum over all pages (for simplicity, we will call this
value TOT hereafter). For very small sets, the biasing produced an output only somewhat
different: about 38% Kendall similarity (see Figure 3). The same happened for large sets,
especially those above 1% of TOT. Finally, the graph makes also clear where we would get the
most different rankings from the non-biased ones
-
8/4/2019 Personalizing the Web Directories
34/40
Someone could wish to bias only on the best pages (the top [0 2]%, as in Figure 4). In this
case, the above results would only be shifted a little bit to the right on the x-axis of the graph,
-
8/4/2019 Personalizing the Web Directories
35/40
i.e., the highest differences would be achieved for a set size from 0.02% to 0.75%. This was
expectable, as all the pages in the biasing set were already top ranked, and it would therefore
take a little bit more effort to produce a different output with such a set. Another possible input
set consists of randomly selected pages (Figure 5). Such a set most probably contains many low
Page Rank pages. This is why, although the biased ranks are very different for low TOT values,
they start to become extremely similar (up to almost the same) after TOT exceeds 0.01%
(because it would take a lot of low Page Rank pages to accumulate a TOT value of 1% of the
overall sum of scores, for example). The extreme case is to bias only on low Page Rank pages
(Figure 6). In this case, the biasing set will contain too many pages even sooner, around TOT =
0.001%. The last experiment is mostly theoretical. One would expect to obtain the smallest
similarities to the non-biased rankings when using a biasing set from [2 5]% (because these
pages are already close to the top, and biasing on them would have best chances to overturn the
list). Experimental results support this intuition (Fig
-
8/4/2019 Personalizing the Web Directories
36/40
The graphs above were initially generated based on a crawl of 3 million pages. Once all of them
had been finalized, we selectively ran similar experiments on the Stanford Web Base crawl,
obtaining similar results. For example, a biasing set of size TOT = 1% containing randomly
selected pages produced rankings with a 0.622% Kendall similarity to the non-biased ones,
whereas a set of TOT = 0.0005% produced a similarity of only 0.137%. This was necessary in
order to prove that the above discussed graphs are not influenced by the crawl size. Even so, the
limits they establish are not totally accurate, because of the random or targeted random selection
(e.g., towards top [0 2]% pages) of our experimental biasing sets.
Is biasing possible in the ODP context?
The URLs collected in the Open Directory are manually added Web pages supposed to
(1) cover the specific topic of the ODP tree leaf they belong to and
-
8/4/2019 Personalizing the Web Directories
37/40
(2) be of high quality. Both requirements are not fully satisfied. Sometimes (rarely though) the
pages are not really representing the topic in which they were added. More
important for PageRank biasing, they usually cover a large interval of page ranks, which made us
decide for the random biasing model. However, we are aware that in this case, the human editors
chose much more high quality pages than low quality ones, and thus the decisions of the analysis
are susceptible to errors. Generally, according to the random model of biasing, every set with
TOT below 0.015% is good for biasing. According to this, all possible biasing sets analyzed in
tables 4, 5 and 3 would generate a different enough Page Rank vector9. We can therefore
conclude that biasing is (most probably) possible on all subsets of the Stanford Open Directory
crawl.
-
8/4/2019 Personalizing the Web Directories
38/40
Web usage mining has been extensively used in order to analyze web log data. There exist
various methods based on data mining algorithms and probabilistic models. The related literature
is very extensive and many of these approaches fall out of the scope of this paper. For more
information, the reader may refer to. There exist many approaches for discovering sequences of
visits in a web site. Some of them are based on data mining techniques, whereas others use
probabilistic models, such as Markov models in order to model the users visits. Such approaches
aim at identifying representative trends and browsing patterns describing the activity in a web
site and can assist the web site administrators to redesign or customize the web site, or improve
the performance of their systems. They do not, however, propose any methods for personalizing
the web sites. There exist some approaches that use the aforementioned techniques in order to
personalize a web site. Contrary to our approach, these approaches do not distinguish between
different users or user groups in order to perform the personalization. Thus, the methods that
seem to be more relevant to ours, in terms of identifying different interest groups and personalize
the web site based on these profiles, are those that are based on collaborative filtering.
Collaborative filtering systems are used for generating recommendations and have been broadly
used in e-commerce. Such systems are based on the assumption that users with common interests
and behavior present similar searching/browsing behavior. Thus, the identification of similar
user profiles enables the filtering of relevant information and the generation of
recommendations.
Similar to such approaches, we also identify users with common interests and use this
information to personalize the topic directory. In our work, however, we do not model the user
profiles as vectors in order to find similar users. Instead, we use clustering to group users into
interest groups. Moreover, we propose the use of sequential pattern mining in order to generate
recommendations. Thus, we also capture the se- quential dependencies within users visits,
whereas this is not the case with collaborative filtering systems. All of the aforementioned
approaches aim at personalizing generic web sites. Our approach focuses on the personalization
of a specific type of web sites, that of topic directories. Since topic directories organize web
content into meaningful categories, we can regard them as a form of digital library or portal. In
this context, we also overview here some approaches for personalizing digital libraries and web
-
8/4/2019 Personalizing the Web Directories
39/40
portals. Some early approaches were based on explicit user input and the personalization services
provided are limited to simplified search functionalities or alerting services. propose the semi-
automatic generation of user recommendations based on implicit user input. In those approaches,
information is extracted from user accesses in the DL re- sources, and then is used for further
retrieval or filtering. As already mentioned, our approach does not limit its personalization
services on identifying the preferences of each individual user alone. Rather, we identify user
groups with common interests and behavior expressed by visits to certain categories and
information resources. This is enabled by approaches that are based on collaborative filtering.
Those approaches, however, fail to capture the sequential dependencies between the users visits,
as discussed previously.
MODELLING TOPIC DIRECTORIES
A topic directory is a hierarchical organization of thematic categories. Each category contains
resources (i.e., links to web pages). A category may have subcategories and/or re- lated
categories. Subcategories narrow the content of broad categories. Related categories contain
similar resources, but they may exist in different places of the directory. Note that the related
relationship is bidirectional, that is, if category N is related to M, then M is also related to N. A
resource cannot belong to more than one category. We consider a graph representation of topic
directories. Definition 3.1. A topic directory D is a labelled graph G(V,E), where V is the set of
nodes and E the set of edges, such that: (a) each node in V corresponds to a category of D, and is
labelled by the category name, (b) for each pair of nodes (n,m) that corresponds to categories
(N,M), where N is subcategory of M in D, there is a directed edge from m to n, and (c) for each
pair of nodes (n,m) that corresponds to categories (N,M), where N and M are related categories
in D, there is a bi directed edge between n and m. The graph G(V,E) may also have shortcuts,
which are directed edges connecting nodes in V . Examples of such graphs are illustrated inFigure 4. The role of shortcuts as a means for personalizing the directory will be further
discussed in Section 5. The case study of Open Directory Project. In our work, we use the Open
Directory Project (ODP) as a case study. Figure 1 illustrates a part of the ODP directory. In
-
8/4/2019 Personalizing the Web Directories
40/40
ODP, there are three types of categories: (a) subcategories (to narrow the content of broad
categories), (b) relevant categories (i.e., the ones appearing inside the see also section, and (c)
symbolic categories (i.e., denoted by the @ character after categorys name). Symboliccategories are subcategories that exist in different places of the directory. We consider relevant
categories as related categories, according to the Definition 3.1
Navigation patterns.
To represent the navigation behaviour of users when browsing the directory, we use the notion of
navigation patterns. A navigation pattern is the sequence of categories visited by a user during a
session. We note that such patterns may include multiple occurrences of the same categories.
This might be the result of users going back and forth within a path in the directory. Finally, we
also underline that during a session, a user may pursue more than one topic interests.