personalizing the web directories

8/4/2019 Personalizing the Web Directories

1/40

Personalizing Web Directories with the Aid of Web Usage Data

Literature Survey:

Computational intelligence models for Personalization

CI has been defined as the study of adaptive mechanisms to enable or facilitate intelligent

behavior in complex and changing environments. This is an ongoing and evolving area of

research since its term was coined by John McCarthy in 1956. Different CI models related to

personalization are given in figure 1.

Fuzzy Systems (FS) and Fuzzy Logic (FL) mimic the concept the way people think, that is, with

reasoning rather than precise. Fuzzy methods were found to be instrumental in web-based

personalization when used with WUM data. User profiles are processed using fuzzy approximate

reasoning to recommend personalized URLs. Handling of user profiles with fuzzy concepts has

been used by IR systems to provide users with personalized search engine results. Based on users

web usage history data, fuzzy methods have been used to categorize or cluster web objects for

web personalization. Fuzzy logic was used with collective or collaborate data mining techniquesto improve the quality of intelligent agents to provide personalized services to users .

Evolutionary Algorithms (EA) use mechanisms inspired by biological evolution such as

reproduction, mutation, recombination and selection. One of the most popular EA is Genetic

Algorithms (GA). They mimic the gene structure in humans based on evolutionary theory. GA

has been used to address some of the flaws of WUM and to tackle different problems such as


2/40

personalized search, IR, query optimization and document representation. GA was applied with

user log mining techniques to get a better understanding of user preferences and discover

associations between different URL addresses. By GA was included randomness in content

filtering rather than strict adherence to predefined user profiles. This is known as the element of

serendipity in IR. This modified GA was introduced for optimal design of a website based on a

multiple optimization criteria taking download time, visualization and product association level

into consideration. Artificial Neural Networks (ANN) or simply Neural Networks (NN) mimic

the biological process of the human brain. A NN can be trained to group users into specified

categories or into clusters.

This is useful in personalization as each user group may possess similar preferences and

hence the content of a web interface can be adapted to each group. NNs can also be trained to

learn the behavior of website users. Inputs for this learning can be derived from WUM data and

CF techniques. The learning ability of neural networks can also be used for real time adaptive

interaction instead of only common content and static based personalization. A NN was used to

construct user profiles. A NN was implemented to categorize e-mail folder. Swarm Intelligence

(SI) is based on the collective behavior of animals in nature such as birds, ants, bees and wasps.

Particle Swarm Optimization (PSO) models the convergence behavior of a flock of birds. PSO

was used for analyzing unique behavior of web user for manipulation of web access log data and

user profile data. Personalized recommendation based on individual user preferences or CF data

has also been explored using PSO. This was done by building up profiles of users and then using

an algorithm to find profiles similar to the current user by supervised learning. Personalized and

automatic content sequencing of learning objects was implemented using PSO. Research has also

been done using PSO as a clustering algorithm but no use of this approach to clustering was

found in relation to website personalization.

Another SI technique is Ant Colony Optimization (ACO) which models the behavior of

ants that leave the nest to wander randomly in search of food and when it is found they leave a

trail of pheromone when returning to the colony. ACO resulted in the development of the

shortest path optimization algorithms and has applications in routing optimization. ACO has

been used to classify web users in WUM (cAnt-WUM algorithm) allowing personalization of the

web system to each user class. Bees Colony Optimization (BCO) is built on basic principles of


3/40

collective bee intelligence. It has been applied to web-based systems to improve the IR systems

of search engines incorporating WUM data, however the issue of personalization has not yet

known to be directly addressed. Wasp Colony optimization (WCO) or Wasp Swarm

Optimization (WSO) has not yet been exploited in comparison to the other SI methods. It models

the behavior of insect wasps in nature. WCO has also been applied to the NP-hard optimization

problem known as the Multiple Recommendations Problem (MRP). It occurs when several

personalized recommendations are running simultaneously and results in churning where a user

is presented with uninteresting recommendations. Further research has to be done however, using

WCO on real, scalable and dynamic data sets. Artificial Immune Systems (AIS) mimic the

functioning of the human immune system as the body learns to handle antigens by producing

antibodies based in previous experience. Applications of AIS have been solving pattern

recognition problems, classification tasks, cluster data and anomaly detection. Already AIS has

been applied to personalization of web-based systems. The human body is represented by a

website, incoming web requests are antigens and learning is paralleled to the learning of the

immune systems to produce the right antibodies to combat each antigen. Using this analogy and

AIS based on WUM was used as a learning system for a website. It is common practice to

combine CI techniques to create a hybrid which seeks to overcome the weakness of one

technique with the strength of another. Several hybrids were applied to personalization of web

based systems. NN was combined with FL to give a hybrid Neuro -Fuzzy strategy for Web

personalization. The topology and parameters of NN were used to obtain the structure and

parameters of fuzzy rules.

The learning ability of NN was then applied to this set of rules. The ability of

evolutionary techniques such as GA, to extract implicit information from user logs was

combined with fuzzy techniques to include vagueness in decision making. This FL-GA hybrid

allows more accurate and flexible modeling of user preferences. User data obtained from web

usage data is the input for a NN. The weights and fitness functions derived from NN training isoptimized using GA to derive classification rules to govern personalized decision making in e-

Business. A fuzzy-PSO approach was introduced to personalize Content Based Image Retrieval

(CBIR). User logs were analyzed and used as the PSO input. Fuzzy principles were applied to

the PSO velocity, position and weight parameters.


4/40

Personalization of web-based systems using CI models

Based on the eight major CI methods described above, it is noticed that WUM is the common

input for all models. Data mining in a sense provides the fuel for personalization using CI

methods. CI methods are comparable to taxonomy of intelligent agents for personalization.

Building on ideas from this approach taxonomy for personalization of web-based systems was

proposed (cf. Fig. 2). Two main uses are identified for CI methods when applied to

personalization: profile generation and profile exploitation. User profiles can further be used to

personalize either the navigation or content of web based systems.

Profile generation

Profile generation is the creation of user profiles based on both implicit WUM data and explicit

user preferences. User profiles can be generated either per individual or group users which

appear to have similar previous web usage habits using CF techniques. Five CI methods found in

previous work which were applied to user profile generation of web based systems are: FL, NN,

PSO, ACO and AIS. FL models are constructed to identify ambiguity in user preferences

however there are many ways of interpreting fuzzy rules and translating human knowledge into

formal controls can be challenging. NN was trained to identify similarities in user behavior

however for proper training the sample size must be large and the NN can be complex due to

over fitting. Both PSO and GA were used to link users behavior by profile-matching but PSO

was found to outperform GA in terms of speed, execution and accuracy. ACO was used to model

users with relative accuracy and simplicity; however its computational complexity causes long

computing time. PSO approach was found to be faster when compared to ACO. AIS was used to

dynamically adapt profiles to changing and new behaviors. The theoretical concept of AIS is not

fully sound however, since in reality other human systems support the functioning of the immune

system and these are not modeled. The artificial cells in AIS do not work autonomously therefore

the success or fail of one part of the system may determine the performance of the following

step.


5/40

A hybrid method uses GA to optimize the input values of a NN, to maximize the output. In this

way the slow learning process of NN is helped with the optimization ability of GA.

Profile exploitation

Profile exploitation personalizes various aspects of a web-based system by predefined user

profiles. Two main approaches to personalize web based systems were identified as

personalization of navigation and personalization of content (cf. fig.2).

Personalized navigation

Personalized navigation includes WUM for personalized IR, such as search engine results, and

URL recommendations. FL, BCO and GA were three main CI methods found for navigation

personalization (cf. fig.2). FL was used for offline processing to recommend URLs to users. It is

relatively fast, deal with natural overlap in user interests and suitable for real time

recommendations. Various FL testing however showed slightly lower precision and harder to

program for the fuzzy part. GA was applied for search and retrieval but is it known to be more

general and abstract than other optimization methods and does not always provide the optimal

solution. BCO was used for IR but it is not a widely covered area of research and currently there

is a better theoretical than experimental understanding. ACO is similar to BCO and has seen

more successful applications. A hybrid between GA and FL was applied to this area. Fuzzy set

techniques were used for better document modeling and genetic algorithms for query

optimization to give personalized search engine results. A Neuro-Fuzzy method combined the


6/40

learning ability of NN with the representation of vagueness in Fuzzy Systems to overcome the

NN black-box behavior and present more meaningful results than FL alone.

Personalized content

Personalized content refers to WUM for personalized web objects on each web page and

sequence of content. FL, NN, GA, PSO and WCO were the main CI techniques found with

applications in this area (cf. fig.2). FL was used for a web search algorithm and to automate

recommendations to ecommerce customers. It was found to be flexible and able to support

ecommerce application. NN was used to group users into clusters for content recommendations

however over fitting problem still exists today GA was applied to devise the best arrangement of

web objects. It was found to be scalable; however it is suggest to be used in collaboration with

other data mining tools. PSO was used to sequence Learning Objects and was chosen because of

relative small number of parameters compared with other techniques such as GA. PSO parameter

selection is also a well researched area. Using a modified PSO for data clustering was found to

give accurate results. WCO was applied on the churning problem of uninteresting content

recommendations to users. This is mostly a theoretical concept, not well tested on real data and

other biological inspired algorithms have found more success such as ACO. Fuzzy-PSO was

created to help improve the effectiveness of standard PSO particle movement in a content based

system.

PROBABILISTIC LATENT SEMANTIC MODELS OF WEB USER NAVIGATIONS

The overall process of Web usage mining consists of three phrases: data preparation and

transformation, pattern discovery, and pattern analysis. The data preparation phase transforms

raw Web log data into transaction data that can be processed by various data mining tasks. In the

pattern discovery phase, a variety of data mining techniques, such as clustering, association rule

mining, and sequential pattern discovery can be applied to the transaction data. The discovered

patterns may then be analyzed and interpreted for use in such applications as Webpersonalization. The usage data preprocessing phase [8, 32] results in a set of n page views, P =

{p1, p2, . . . , pn} and a set of m user sessions, U = {u1, u2, . . . , um}. A page view is an

aggregate representation of a collection of Web objects (e.g. pages) contributing to the display on

a users browser resulting from a single user action (such as a click through, product purchase, or

database query). The Web session data can be conceptually viewed as an m n session-page


7/40

view binary matrix UP = [w(ui, pj )]mn, where w(ui, pj) represents the weight of page view pj

in a user session ui. The weights can be binary, representing the existence or non-existence of the

page view in the session, or they may be a function of the occurrence or duration of the page

view in that session. PLSA is a latent variable model which associates hidden (unobserved)

factor variable Z = {z1, z2, ..., zl} with observations in the co-occurrences data. In our context,

each observation corresponds to an access by a user to a Web resource in a particular session

which is represented as an entry of the m n co-occurrence matrix UP. The probabilistic latent

factor model can be described as the following generative model:

1. select a user session ui from U with probability Pr(ui),

2. pick a latent factor zk with probability Pr(zk|ui),

3. Generate a page view pj from P with probability Pr(pj|zk).

As a result we obtain an observed pair (ui, pj ), while the latent factor variable zk is

discarded. Translating this process into a joint probability model results in the following:

Summing over all possible choices of zk from which the observation could have been generated.

Using Bayes rule, it is straightforward to transform the joint probability into:

Now, in order to explain a set of observations (U, P), we need to estimate the parameters Pr(zk),

Pr(ui|zk), Pr(pj |zk), while maximizing the following likelihood L(U, P) of the observations,


8/40

Expectation-Maximization (EM) algorithm is a well known approach to performing maximum

likelihood parameter estimation in latent variable models. It alternates two steps:

(1) an expectation (E) step where posterior probabilities are computed for latent variables, based

on the current estimates of the parameters,

(2) a maximization (M) step, re-estimate the parameters in order to maximize the expectation of

the complete data likelihood. The EM algorithm begins with some initial values of Pr(zk),

Pr(ui|zk), and Pr(pj |zk). In the expectation step we compute:

In the maximization step, we aim at maximizing the expectation of the complete data likelihood

E(LC),

While taking into account the constraints, l k=1 Pr(zk) = 1, on the factor probabilities, as well as

the following constraints on the two conditional probabilities:


9/40

Through the use of Lagrange multipliers (see for details), we can solve the constraint

maximization problem to get the following equations for re-estimated parameters:

Iterating the above computation of expectation and maximization steps monotonically increases

the total likelihood of the observed data L(U, P) until a local optimal solution is reached. The

computational complexity of this algorithm is O(mnl), where m is the number of user sessions, n

is the number of page views, and l is the number of factors. Since the usage observation matrix

is, in general, very sparse, the memory requirements can be dramatically reduced using efficient

sparse matrix representation of the data.

DISCOVERY AND ANALYSIS OF USAGE PATTERN WITH PLSA

One of the main advantages of PLSA model in Web usage mining is that it generates

probabilities which quantify relationships between Web users and tasks, as well as Web pages

and tasks. From these basic probabilities, using probabilistic inference, we can derive

relationships among users, among pages, and between users and pages. Thus this framework

provides a flexible approach to model a variety of types of usage patterns. In this section, we will

describe various usage patterns that can be derived using the PLSA model. As noted before, the

PLSA model generates probabilities Pr(zk), which measures the probability of a certain task is

chosen; Pr(ui|zk), the probability of observing a user session given a certain task; and Pr(pj |zk),

the probability of a page being visited given a certain task. Applying Bayes rule to these

probabilities, we can generate the probability that a certain task is chosen given an observed user

session:


10/40

and the probability that a certain task is chosen given an observed page view:

In the following, we discuss how these models can be used to derive different kinds of usage

patterns. We will provide several illustrative examples of such patterns, from real Web usage

data, in Section 4.

Characterizing Tasks by Page views or by User Sessions

Capturing the tasks or objectives of Web users can help the analyst to better understand these

users preferences and interests. Our goal is to characterize each task, represented by a latent

factor, in a way that is easy to interpret. One possible approach is to find the prototypical pages

that are strongly associated with a given task, but that are not commonly identified as part of

other tasks. We call each such page a characteristic page for the task, denoted by pch. This

definition of prototypical has two consequences; first, given a task, a page which is seldom

visited cannot be a good characteristic page for that task. Secondly, if a page is frequently visited

as part of a certain task, but is also commonly visited in other tasks, the page is not a good

characteristic page. So we define characteristic pages for a task zk as the set of all pages, pch,

which satisfy:

Where is a predefined threshold. By examining the characteristic pages of each task, we can

obtain a better understanding of the nature of these tasks. Characterizing tasks in this way can

lead to several applications. For example, most Web sites allow users to search for relevant

pages using keywords. If we also allow users to explicitly express their intended task(s) (via

inputting task descriptions or choosing from a task list), we can return the characteristic pages for

the specified task(s), which are likely to lead users directly to their objectives. A similar


11/40

approach can be used to identify prototypical user sessions for each task. We believe that a

user session involving only one task can be considered as the characteristic session for the task.

So, we define the characteristic user sessions, uch, for a task, zk, as sessions which satisfy

where is a predefined threshold. When a user selects a task, returning such exemplar sessions

can provide a guide to the user for accomplishing the task more efficiently. This approach can

also be used in the context of collaborative filtering to identify the closest neighbors to a user

based on the tasks performed by that user during an active session.

User Segments Identification

Identifying Web user groups or segments is an important problem in Web usage mining. It helps

Web site owners to understand and capture users common interests and preferences. We can

identify user segments in which users perform common or similar task, by making inferences

based on the estimated conditional probabilities obtained in the learning phase. For each task zk,

we choose all user sessions with probability Pr(ui|zk) exceeding a certain threshold to get a

session set C. Since each user sessions, can also be represented as a page view vector, we can

further aggregate these users sessions into a single page views vector to facilitate interpretation.

The algorithm of generating user segments is as follows:

1. Input: Pr(ui|zk), user session-page matrix UP and threshold .

2. For each zk, choose all the sessions with Pr(ui|zk) to get a candidate session set C.

3. For each zk, compute the weighed average of all the chosen sessions in set Cto get a page

vector defined as:


12/40

4. For each factor zk, output page vector This page vector consists of a set of weights, for

each page view in P, which represents the relative visit frequency of each page view for this user

segment. We can sort the weights so that the top items in the list correspond to the most

frequently visited pages for the user segment. These user segments provide an aggregate

representation of all individual users navigational activities in the a particular group. In addition

to their usefulness in Web analytics, user segments also provide the basis for automatically

generating item recommendations. Given an active user, we compare her activity to all user

segments and find the most similar one. Then, we can recommend items (e.g., pages) with

relatively high weights in the aggregate representation of the segment. In Section 4, we conduct

experimental evaluation of the user segments generated from two real Web sites.

Identifying the Underlying Tasks of a User Session

To better understand the preferences and interests of a single user, it is necessary to identify the

underlying tasks performed by the user. The PLSA model provides a straightforward way to

identify the underlying tasks in a given user session. This is done by examining Pr(task|session),

which is the probability of a task being performed, given the observation of a certain user

session. For a user session u, we select the top tasks zk with the highest Pr(zk|u) values, as the

primary task(s) performed by this user. For a new user session, unew, not appearing in the

historical navigational data, we can adopt a folding-in method as introduced in to generatePr(task|session) via the EM algorithm. In the E-step, we compute

Here, w(unew, p) represents the new users visit frequency on the specified page p. After we

generate these probabilities, we can use the same method to identify the primary tasks for the

new user session. The identification of the primary tasks contained in user sessions can lead to

further analysis. For example, after identifying the tasks in all user sessions, each session u can

be transformed into a higher-level representation,


13/40

where zi denotes task i and wi denotes Pr(zi|u). This, in turn, would allow the discovery and

analysis of task-level usage patterns, such as determining which tasks are likely to be visited

together, or which tasks are most (least) popular, etc. Such higher-level patterns can help site

owners better evaluate the Web site organization.

Integration of Usage Patterns with Web Content Information

Recent studies have emphasized the benefits of integrating semantic knowledge about the

domain (e.g., from page content features, relational structure, or domain ontologies) in the Web

usage mining process. The integration of content information about Web objects with usage

patterns involving those objects provides two primary advantages. First, the semantic

information provides additional clues about the underlying reasons for which a user may or may

not be interested in particular items. Secondly, in cases where little or no rating or usage

information is available (such as in the case of newly added items, or in very sparse data sets),

the system can still use the semantic information to draw reasonable conclusions about user

interests. The PLSA model described here also provides an ideal and uniform framework for

integrating content and usage information. Each page view contains certain semantic knowledge

represented by the content information associated with that page view.

By applying text mining and information retrieval techniques, we can represent each page

view as an attribute vector. Attributes may be the keywords extracted from the page views, or

structured semantic attributes of the Web objects contained in the page views. As before, we

assume there exists a set of hidden factors z Z = {z1, z2, ..., zl}, each of which represents a

semantic group of pages. They can be a group of pages which have similar functionalities for

users performing a certain task, or a group of pages which contain similar content information or

semantic attributes. However, now, in addition to the set of page views, P, and the set of usersessions, U, we also specify a set of t semantic attributes, A = {a1, a2, . . . , at}. To model the

user-page observations, we use


14/40

These models can then be combined based on the common component Pr(pj |zk). This can be

achieved by maximizing the following log-likelihood function with a predefined weight .

where is used to adjust the relative weights of two observations. The EM algorithm can again

be used to generate estimates for Pr(zk), Pr(ui|zk), Pr(pj |zk), and Pr(aq|zk). By applying

probabilistic inferences, we can measure the relationships among users, pages, and attributes,

thus we are able to answer questions such as, What are the most important attributes for a group

of users?, or Given an Web page with a specified set of attributes, will it be of interest to a

given user?, and so on.

EXPERIMENTS WITH PLSA MODEL

In this section, we use two real data sets to perform experiments with our PLSA-based Web

usage mining framework. We first provide several illustrative examples of characterizing users

tasks, as introduced in the previous section, and of identifying the primary tasks in an individual

user session. We then perform two types of evaluations based on the generated user segments.

First we evaluate individual user segments to determine the degree to which they represent

activities of similar user. Secondly, we evaluate the effectiveness of these user segments in the

context of generating automatic recommendations. In each case, we compare our approach with

the standard clustering approach for the discovery of Web user segments. In order to compare the

clustering approach to the PLSAbased model, we adopt the algorithm presented in for creating

aggregate profiles based on session clusters. In the latter approach, first, we apply a


15/40

multivariate clustering technique such as k-means to user-session data in order to obtain a set of

user clusters TC = {c1, c2, ..., ck}; then, an aggregate representation, prc, is generated for each

cluster c as a set of page view-weight pairs:

where the significance weight, weight is given by weight(p, prc) = (1/|c|)uc w(p, u) and

w(p, u) is the weight of page view p of the user session u c. Thus, each segment is represented

as a vector in the page view space. In the following discussion, by a user segment, we mean its

aggregate representation as a page view vector.

Data Sets

In our experiments, we use Web server log data from two Web sites. The first data set is based

on the server log data from the host Computer Science department. This Web site provide

various functionalities to different types of Web users. For example, prospective students can

obtain program and admissions information or submit online applications. Current students can

browse course information, register for courses, make appointments with faculty advisors, and

log into the Intranet to do degree audits. Faculty can perform student advising functions online or

interact with the faculty Intranet. After data preprocessing, we identified 21,299 user sessions

(U) and 692 pageviews (P), with each user session consisting of at least 6 pageviews. This data

set is referred to as the CTI data. The second data set is from the server logs of a local affiliate

of a national real estate company. The primary function of the Web site is to allow prospective

buyers to visit various pages and information related to some 300 residential properties. The

portion of the Web usage data during the period of analysis contained approximately 24,000 user

sessions from 3,800 unique users. During preprocessing, we recorded each user-property pair

and the corresponding visit frequency. Finally, the data was filtered to limit the final data set to

those users that had visited at least 3 properties. In our final data matrix, each row represented a

user vector with properties as dimensions and visit frequencies as the corresponding dimension

values. We refer to this data set as the Realty data. Each data set was randomly divided into

multiple training and test sets to use with 10-fold cross-validation. By conducting sensitivity

analysis, we chose 30 factors in the case of CTI data and 15 factors for the Realty data. To avoid

overtraining, we implemented the Tempered EM algorithm to train the PLSA model.


16/40

Examples Usage Patterns Based on the PLSA Models

Figure 1 depicts an example of the characteristic pages for a specific discovered task in the CTI

data. The first 6 pages have the highest Pr(page|task)Pr(task|page) values, thus are considered as

the characteristic pages of this task. Observing these characteristic pages, we may infer that this

task corresponds to prospective students who are completing an online admissions application.

Here characteristic has two implications. First, if a user wants to perform this task, he/she must

visit these pages to accomplish his/her goal. Secondly, if we find a user session contains these

pages, we can claim the user must have performed online application. Some page may not be

characteristic pages for the task, but may still be useful for the purpose of analysis. An example

of such a page is the /news/ page which has a relatively high Pr(page|task) value, and a low


17/40

Pr(task|page) value. Indeed, by examining the at the site structure, we found that this page serves

as a navigational page, and it can lead users to different sections of the site to perform different

tasks (including the online application). This kind of discovery can help Web site designer to

identify the functionalities of pages and reorganize Web pages to facilitate users navigation.

Figure 2 identifies three tasks in the Realty data. In contrast to the CTI data, in this data set the

tasks represent common real estate properties visited by users, thus reflecting user interests in

similar properties. The similarities are clearly observed when property attributes are shown for

each characteristic page. From the characteristic pages of each task, we infer that Task 4

represents users interest in newer and more expensive properties, while Task 0 reflects interest

in older and very low priced properties. Task 5 represents interest in properties midrange prices.

We can also identify prototypical users corresponding to specific tasks. An example of such a

user session is depicted in Figure 3 corresponding to yet another task in the realty data which

reflects interest in very high priced and large properties (task not shown here).


18/40

Our final example is this section shows how the prominent tasks contained in a given user

session can be identified. Figure 4 depicts a random user session from CTI data. Here we only

show the tasks IDs which have the highest probabilities Pr(task|session). As indicated, the

dominant tasks for this user session are Tasks 3 and 25. The former is, in fact, the online

application task discussed earlier, and the latter is a task that represents international students

who are considering applying for admissions. It can be easily observed that, indeed, this session

seems to identify an international student who, after checking admission and visa requirements,

has applied for admissions online.


19/40

Evaluation of User Segments and Recommendations

We used two metrics to evaluate the discovered user segments. The first is called the Weighted

Average Visit Percentage (WAVP). WAVP allows us to evaluate each segment individually

according to the likelihood that a user who visits any page in the segment will visit the rest of the

pages in that segment during the same session. Specifically, let T be the set of transactions in the

evaluation set, and for a segment s, let Ts denote a subset of T whose elements contain at least

one page from s. The weighted average similarity to the segment s over all transactions is then

computed (taking both the transactions and the segments as vectors


20/40

Note that a higher WAVP value implies better quality of a segment in the sense that the segment

represents the actual behavior of users based on their similar activities. For evaluating the

recommendation effectiveness, we use a metric called Hit Ratio in the context of top-N

recommendation.

For each user session in the test set, we took the first K pages as a representation of an

active session to generate a top-N recommendation set. We then compared the recommendations

with the pageview (K +1) in the test session, with a match being considered a hit. We define the

Hit Ratio as the total number of hits divided by the total number of user sessions in the test set.

Note that the Hit Ratio increases as the value of N (number of recommendations) increases.

Thus, in our experiments, we pay special attention to smaller number recommendations (between

1 and 20) that result in good hit ratios. Note that a higher WAVP value implies better quality of a

segment in the sense that the segment represents the actual behavior of users based on their

similar activities. For evaluating the recommendation effectiveness, we use a metric called Hit

Ratio in the context of top-N recommendation. For each user session in the test set, we took the

first K pages as a representation of an active session to generate a top-N recommendation set.

We then compared the recommendations with the page view (K +1) in the test session, with a

match being considered a hit. We define the Hit Ratio as the total number of hits divided by the

total number of user sessions in the test set. Note that the Hit Ratio increases as the value of N

(number of recommendations) increases. Thus, in our experiments, we pay special attention tosmaller number recommendations (between 1 and 20) that result in good hit ratios.


21/40

In the first set of experiments we compare the WAVP values for the generated segments using

the PLSA model and those generated by the clustering approach. Figures 5 and 6 depict theseresults for the CTI and Realty data sets, respectively. In each case, the segments are ranked in the

decreasing order of WAVP. The results show clearly that the probabilistic segments based on the

latent factor factors provides a significant advantage over the clustering approach. In the second

set of experiments we compared the recommendation accuracy of the PLSA model with that of

kmeans clustering segments. In each case, the recommendations are generated according to the


22/40

recommendation algorithm presented in Section 3.2. The recommendation accuracy is measured

based on hit ratio for different number of generated recommendations. These results are depicted

in Figures 7 and 8 for the CTI and Realty data sets, respectively. Again, the results show a clear

advantage for the PLSA model. In most realistic situations, we are interested in a small, but

accurate, set of recommendations. Generally, a reasonable recommendation set might contain 5

to 10 recommendations. Indeed, this range of values seem to represent the largest improvements

of the PLSA model over the clustering approach.

ODP: The Open DirectoryProject

Description. The DMOZ Open Directory Project (ODP) [20] is the largest, most

comprehensive human-edited web page catalog currently available. It covers 4 million sites filed

into more than 590,000 categories (16 wide-spread top-categories, such as Arts, Computers,

News, Sports, etc.) Currently, there are more than 65,000 volunteering editors maintaining it.

ODPs data structure is organized as a tree, where the categories are internal nodes and pages are

leaf nodes. By using symbolic links, nodes can appear to have several parent nodes. Since ODP

truly is free and open, everybody can contribute or re-use the dataset, which is available in RDF

(structure and content are available separately). Google for example uses ODP as basis for its

Google Directory service.

Applications

Besides its re-use in other directory services, the ODP taxonomy is used as a basis for various

other research projects. In Persona, ODP is applied to enhance HITS with dynamic user profiles

using a tree coloring technique (by keeping track of the number of times a user has visited

pages of a specific category). Users can rate a page as being good or unrelated regarding

their interest. This data is then used to rank and omit interesting/unwanted results. While asks

users for feedback, we only rely on user profiles, i.e., a one-time user interaction. More, we do

not develop our search algorithm on top of HITS, but on top of any search algorithm, as a

refinement. In, a similar approach using the ODP taxonomy is applied onto a recommender

system of research papers. The Open Directory can also be used as a reference source containing

good pages, to fight web spam containing uninteresting URLs through white listing, as a web

corpus for comparisons of rank algorithms, as well as for focused crawling towards special-


23/40

interest pages. Unfortunately, the free availability of ODP also has its downside. A clone of the

directory modified to contain some spam pages could trap people to link to this fake directory,

which results in an increased ranking not only for this directory clone, but also for the injected

spam pages.

Page Rank and Personalized Page Rank

Page Rank computes Web page scores based on the graph inferred from the link structure of the

Web. It is based on the idea that a page has high rank if the sum of the ranks of its back links is

high. Given a page p, its input I(p) and output O(p) sets of links, the Page Rank formula is:

The dampening factor c < 1 (usually 0.15) is necessary to guarantee convergence and to limit the

effect of rank sinks [2]. Intuitively, a random surfer will follow an outgoing link from the current

page with probability (1 c) and will get bored and select a random page with probability c (i.e.,

the E vector has all entries equal to 1/N, where N is the number of pages in the Web graph).

Initial steps towards personalized page ranking are already described by who proposed a slight

modification of the above presented algorithm to redirect the random surfer towards preferred

pages using the E vector. Several distributions for this vector have been proposed since.

Topic-sensitive Page Rank

Haveliwala builds a topic oriented Page Rank, starting by computing off-line a set of 16 Page-

Rank vectors biased on each of the 16 main topics of the Open Directory Project. Then, the

similarity between a user query and each of these topics is computed, and the 16 vectors are

combined using appropriate weights. Personalized Page Rank. A more recent investigation, uses

a different approach: it focuses on user profiles. One Personalized Page Rank Vector (PPV) is

computed for each user. The personalization aspect of this algorithm stems from a set of hubs

(H)1, each user having to select her preferred pages from it. PPVs can be expressed as a linear

combination of PPVs for preference vectors with a single non-zero entry corresponding to each


24/40

of the pages from the preference set (called basis vectors). The advantage of this approach is that

for a hub set of N pages, one can compute 2N Personalized Page Rank vectors without having to

run the algorithm again, unlike, where the whole computation must be performed for each

biasing set. The disadvantages are forcing the users to select their preference set only from

within a given group of pages (common to all users), as well as the relatively high computation

time for large scale graphs.

USING ODP METADATA FOR PERSONALIZED SEARCH

Motivation. We presented in Section 2.2 the most popular approaches to personalizing Web

search. Even though they are the best so far, they all have some important drawbacks. In, we

need to run the entire algorithm for each preference set (or biasing set), which is practically

impossible in a large-scale system. At the other end, computes biased PageRank vectors limited

only to the broad 16 top-level categories of the ODP, because of the same problem. Improves

this somewhat, allowing the algorithm to bias on any subset of a given set of pages (H).

Although work has been done in the direction of improving the quality of this latter set [4], one

limitation is still that the preference set is restricted to a subset of this given set H (if H = {CNN,

FOX News} we cannot bias on MSNBC for example). More importantly, the bigger H is, the

more time is needed to run the algorithm. Thus finding 1Note that hubs were defined here as

pages with high Page Rank, differently from the more popular definition from.

a simpler and faster algorithm with at least similar personalization granularity is still a worthy

goal to pursue. In the following we make another step towards this goal. Introduction. Our first

step was to evaluate how ODP search compares with Google search, specifically exploiting the

fact that all ODP entries are categorized into the ODP topic hierarchy. We started with the

following two observations: 1. given the fact that ODP just includes 4 million entries, and the

Google database includes 8 billion, does ODP-based search stand a chance of being comparable

to Google? 2. ODP advanced search offers a rudimentary personalized search feature by

restricting the search to the entries of just one of the 16 main categories. Google directory offers

a related feature, by offering to restrict search to a specific category or subcategory. Can we

improve this personalized search feature, taking the user profile into account in a more


25/40

sophisticated way, and how does such an enhanced personalized search on the ODP or Google

entries compare to ordinary Google results? Most people would probably answer (1) No, not

yet, and (2) Yes. In the following Section we will prove the correctness of the second answer

by introducing a new personalized search algorithm, and then we will concentrate on the first

answer in the experiments Section.

Algorithm

Our algorithm is exploiting the annotations accumulated in generic large-scale annotations such

as the Open Directory. Even though we concentrate our forthcoming discussion on ODP,

practically any similar taxonomy can be used. These annotations can be easily used to achieve

personalization, and can also be combined with the initial Page Rank algorithm. We define user

profiles using a simple approach: each user has to select several topics from the ODP, which best

fit her interests. For example, a user profile could look like this:

Then, at run-time, the output given by a search service (from Google, ODP Search, etc.) is re-

sorted using a calculated distance from the user profile to each output URL. The execution is

also depicted in Algorithm 3.1.


26/40

Distance Metrics When performing search on Open Directory, each resulting URL comes with

an associated ODP topic. Similarly, a good amount of the URLs output by Google is connected

to one or more topics within the Google Directory (almost 50%, as discussed in Section 3.2).

Therefore, in both cases, for each output URL we are dealing with two sets of nodes from the

topic tree: (1) Those representing the user profile (set A), and (2) those associated with the URL

(set B). The distance between these sets can then be defined as the minimum distance between

all pairs of nodes given by the Cartesian product A B. Finally, there are quite a few

possibilities to define the distance between two nodes. Even though, as we will see from the

experiments, the simplest approaches already provide very good results, we are now performing

an optimality study2 to determine which metric best fits this kind of search. In the following, we

will present our best solutions so far. Nave Distances. The simplest solution is the minimum

tree distance, which, given two nodes a and b, returns the sum of the minimum number of tree

edges between a and the sub sumer (the deepest node common to both a and b) plus the

minimum number of tree edges between b and the subsumer (i.e., the shortest path between a and

b). On the example from Figure 1, the distance between /Arts/Architecture and

/Arts/Design/Interior Design/Events/Competitions is 5, and the subsumer is /Arts. If we also

consider the inter-topic links from the Open Directory, the simplest distance becomes the graph


27/40

shortest path between a and b. For example, if there is a link between Interior Design and

Architecture in Figure 1, then the distance between Competitions and Architecture is 3. This

solution implies to load either the entire topic graph or all the inter-topic links into memory.

Furthermore, its utility is subjective from user to user: the existence of a link between

Architecture and Interior Design does not always imply that a famous architect (one level below

in the tree) is very close to the area of interior design. We can consider these links in our metric

in three ways: 1. Consider the graph containing all intra-topic links and output the shortest path

between a and b. 2. Consider graph containing only the intra-topic links directly connected to a

and b and output the shortest path. 2We refer the reader to for an in-depth view of the approach

we took in this study. 3. If there is an intra-topic link between a and b, output 1. Otherwise,

ignore all intra-topic links and output the tree distance between a and b. Complex Distances. The

main drawback of the above metrics comes from the fact that they ignore the depth of the

subsumer. The bigger this depth is, the more related are the nodes (i.e., the concepts represented

by them). This problem is solved by, which investigates ten intuitive strategies for measuring

semantic similarity between words using hierarchical semantic knowledge bases such as Word

Net [18]. Each of them was evaluated experimentally on a group of testers, the best one having a

0.9015 correlation between the human judgment and the following formula:

The parameters are as follows:

and were defined as 0.2 and 0.6 respectively, h is the tree-depth of the sub sumer, and l is the

semantic path length between the two words. Considering we have several words attached to

each concept and sub-concept, then l is 0 if the two words are in the same concept, 1 if they are

in different concepts, but the two concepts have at least one common word, or the tree shortest

path if the words are in different concepts which do not contain common words. Although thismeasure is very good for words, it is not perfect when we apply it to the Open Directory topical

tree because it does not make a difference between the distance from a (the profile node) to the

subsumer, and the distance from b (the output URL) to the subsumer. Consider node a to be

/Top/Games and b to be /Top/Computers/Hardware/Components/Processors/x86. A teenager

interested in computer games (level 2 in the ODP tree) could be very satisfied receiving a page


28/40

about new processors (level 6 in the tree) which might increase his gaming quality. On the other

hand, the opposite scenario (profile on level 6 and output URL on level 2) does not hold any

more, at least not to the same extent: a processor manufacturer will generally be less interested in

the games existing on the market. This leads to our following extension of the above formula:

with l1 being the shortest path from the profile to the subsumer, l2 the shortest path from the

URL to the subsumer, and a parameter in [0, 1]. Combining the Distance Function with Google

Page Rank. And yet something is still missing. If we use Google to do the search and then sort

the URLs according to the Google Directory taxonomy, some high quality pages might be

missed (i.e., those which are top ranked, but which are not in the directory). In order to integrate

that, the above formula could be combined with the Google Page Rank. We propose the

following approach:

Conclusion. Human judgment is a non-linear process over information sources, and therefore it

is very difficult (if not impossible) to propose a metric which is in perfect correlation to it. A

thorough experimental analysis of all these metrics (which we are currently performing, but

which is outside the scope of this paper) could give us a good enough approximation. In the next

Section we will present some experiments using the simple metric presented first, and show that

it already yields quite reasonable improvements.


29/40

Experimental Results

To evaluate the benefits of our personalization algorithm, we interviewed 17 of our colleagues

(researchers in different computer science areas, psychologists, pedagogues and designers),

asking each of them to define a user profile according to the Open Directory topics (see Section

3.1 for an example profile), as well as to choose three queries of the following types: One clear

query, which they knew to have one or maximum two meanings3 One relatively ambiguous

query, which they knew to have two or three meanings One ambiguous query, which they knew

to have at least three meanings, preferably more We then compared test results using the

following four types of Web search: 1. Plain Open Directory Search 2. Personalized Open

Directory Search, using our algorithm from Section 3.1 to reorder the top 1000 results returned

by the ODP Search 3. Google Search, as returned by the Google API [8]

Personalized Google Search, using our algorithm from Section 3.1 to reorder the top 100 URLs

returned by the Google API4, and having as input the Google Directory topics returned by the

API for each resulting URL. For each algorithm, each tester received the top 5 URLs with

respect to each type of query, 15 URLs in total. All test data was shuffled, such that testers were

neither aware of the algorithm, nor of the ranking of each assessed URL. We then asked the

subjects to rate each URL from 1 to 5, 1 defining a very poor result with respect to their profile

and expectations (e.g., topic of the result, content, etc.) and 5 a very good one5. Finally, for each

sub-set of 5 URLs we took the average grade as a measure of importance attributed to that pair.

The average values for all users and for each of these pairs can be found in table 1,

together with the averages over all types of queries for each algorithm. We of course expected

the plain ODP search to be significantly worse than the Google search, and that was the case:

an average of 2.41 points for ODP versus the 2.76 average received by Google. Also predictable

was the dependence of the grading on the query type. If we average the values on the threecolumns representing each query type, we get 2.54 points for ambiguous queries, 2.91 for semi-

ambiguous ones and 3.25 for clear ones - thus, the clearer was the query, the better rated were

the URLs returned. Personalized Search using ODP. But the same table 1 also provides us with a

more surprising result: The personalized search algorithm is clearly better than Google search,

regardless whether we use Open Directory or Google Directory as taxonomy. Therefore, a


30/40

personalized search on a well-selected set of 4 million pages often provides better results than a

non-personalized one over a 8 billion set. This a clear indicator that taxonomy-based result

sorting is indeed very useful. For the ODP experiments, only our clear queries did not receive a

big improvement, mainly because for some of

these queries ODP contains less than 5 URLs matching both the query and the topics expressed

in the user profile. Personalized Search using Google. Similarly, personalized search using

Google Directory was far better than the usual Google search. We would have expected it to be

even better than the ODP based personalized search, but results were probably negatively

influenced by the fact that the ODP experiments were run on 1000 results, whereas the Google

Directory ones only on 100, due to the limited number of Google API licenses we had. The

grading results are summarized in Figure 2. Generally, we can conclude that personalization

significantly increases output quality for ambiguous and semi-ambiguous queries. For clear

queries, one should prefer Google to Open Directory search, but also Google Directory search to


31/40

the plain Google search. Also, the answers we sketched in the beginning of this Section proved

to be true: Google search is still better than Open Directory search, but we provided a

personalized search algorithm which outperforms the existing Google and Open Directory search

capabilities. Another interesting result is that 40.98% of the top 100 Google pages were also

contained in the Google Directory. More specifically, for the ambiguous queries 48.35% of the

top pages were in the directory, for the semi-ambiguous ones 41.35%, and for the clear ones

33.23%6. Finally, let us add that we performed statistical significance tests7 on our experiments,

obtaining the following results: Statistical significance with an error rate below 1% for the

algorithm criterion, i.e., there is significant difference between each algorithm grading. An

error rate below 25% for the query type criterion, i.e., the difference between the average

grades with respect to query types is less statistically significant. Statistical significance with an

error rate below 5% for the inter-relation between query type and algorithm, i.e., the re-

EXTENDING ODP ANNOTATIONS TO THE WEB

In the last Section we have shown that using ODP entries and their categorization directly for

personalized search turns out to be amazingly good. Can this huge annotation effort invested in

the ODP project (with 65,000 volunteers participating in building and maintaining the ODP

database) be extended to the rest of the Web? This would be useful if we want to find less highly

rated pages not contained in the directory. Just extending the ODP effort does not scale, because

first, significantly increasing the number of volunteers seems improbable, and second, extending

the selection of ODP entries to a larger percentage obviously becomes harder and less rewarding

once we try to include more than just the most important pages for a specific topic. We start


32/40

with the following questions: Given that Page Rank for a large collection of Web pages can be

biased towards a smaller subset, can this be done with sets of ODP entries corresponding to

given categories / subcategories as well? Specifically, ODP entries consist of many of the

most important entries in a given category. Do we have enough entries for each topic such that

biasing on these entries makes a difference?

When does biasing make a difference?

One of the most important work investigating Page Rank biasing is. It first uses the 16 top levels

of the ODP to bias Page- Rank on and then provides a method to combine these 16 resulting

vectors into a more query-dependant ranking. But what if we would like to use one or several

ODP (sub-)topics to compute a Personalized Page Rank vector? More general, what if we would

like to achieve such a personalization by biasing Page Rank towards some generic subset of

pages from the current Web crawl we have? Many authors have used such biasing in their

algorithms. Yet none have studied the boundaries of this personalization, the characteristics the

biasing set has to exhibit in order to obtain relevant results (i.e., rankings which are different

enough from the non-biased Page Rank). We will investigate this in the current Section. Once

these boundaries are defined, we will use them to evaluate (some of) the biasing sets available

from ODP in Section 4.2. First, let us establish a characteristic function for biasing sets, which

we will use as parameter determining the effectiveness of biasing. Pages in the World Wide Web

can be characterized in quite a few ways.

The simplest of them is the out-degree (i.e., total number of out-going links), based on

the observation that if biasing is targeted to such a page, the newly achieved increase in Page

Rank score will be passed forward to all its out-neighbors (pages to which it points). A more

sophisticated version of this measure is the hub value of pages. Hubs were initially defined in

and are pages pointing to many other high quality pages. Reciprocally, high quality pages

pointed to by many hubs are called authorities. There are several algorithms for calculating thismeasure, the most common ones being HITS and its more stable improvements SALSA and

Randomized HITS. Yet biasing on better hub pages will have less influence on the rankings

because the vote a page gives is propagated to its out-neighbors divided by its out-degree.

Moreover, there is also an intuitive reason against this measure: Page Rank biasing is usually

performed to achieve some degree of personalization and people tend to prefer highly valued


33/40

authorities to highly valued hubs. Therefore, a more natural measure is an authority-based one,

such as the non-biased Page Rank score of a page. Even though most of the biasing sets consist

of high Page Rank pages, in order to make this analysis complete we have run our experiments

on different choices for these sets, each of which must be tested with different sizes. For

comparison to Page Rank, we used two degrees of similarity between the non-biased Page Rank

and each resulting biased vector of ranks. They are defined in as follows: 1. O Sim indicates the

degree of overlap between the top n elements of two ranked lists and It is defined as

KSim is a variant of Kendalls T distance measure. Unlike OSim, it measures the degree of

agreement between the two ranked lists. If U is the union of items in and and is U \

then let be the extension of containing appearing after all items in . Similarly,

is defined as an extension of . Using these notations, KSim is defined as follows:

Even though used n = 20, we chose n to be 100, after experimenting with both values and

obtaining more stable results with the latter value. A general study of different similarity

measures for ranked lists can be found in. Let us start by analyzing the biasing on high quality

pages (i.e., with a high Page Rank). We consider the most common set to contain pages in the

range [0 10]% of the sorted list of Page- Rank scores. We varied the sum of scores within this

set between 0.00005% and 10% of the total sum over all pages (for simplicity, we will call this

value TOT hereafter). For very small sets, the biasing produced an output only somewhat

different: about 38% Kendall similarity (see Figure 3). The same happened for large sets,

especially those above 1% of TOT. Finally, the graph makes also clear where we would get the

most different rankings from the non-biased ones


34/40

Someone could wish to bias only on the best pages (the top [0 2]%, as in Figure 4). In this

case, the above results would only be shifted a little bit to the right on the x-axis of the graph,


35/40

i.e., the highest differences would be achieved for a set size from 0.02% to 0.75%. This was

expectable, as all the pages in the biasing set were already top ranked, and it would therefore

take a little bit more effort to produce a different output with such a set. Another possible input

set consists of randomly selected pages (Figure 5). Such a set most probably contains many low

Page Rank pages. This is why, although the biased ranks are very different for low TOT values,

they start to become extremely similar (up to almost the same) after TOT exceeds 0.01%

(because it would take a lot of low Page Rank pages to accumulate a TOT value of 1% of the

overall sum of scores, for example). The extreme case is to bias only on low Page Rank pages

(Figure 6). In this case, the biasing set will contain too many pages even sooner, around TOT =

0.001%. The last experiment is mostly theoretical. One would expect to obtain the smallest

similarities to the non-biased rankings when using a biasing set from [2 5]% (because these

pages are already close to the top, and biasing on them would have best chances to overturn the

list). Experimental results support this intuition (Fig


36/40

The graphs above were initially generated based on a crawl of 3 million pages. Once all of them

had been finalized, we selectively ran similar experiments on the Stanford Web Base crawl,

obtaining similar results. For example, a biasing set of size TOT = 1% containing randomly

selected pages produced rankings with a 0.622% Kendall similarity to the non-biased ones,

whereas a set of TOT = 0.0005% produced a similarity of only 0.137%. This was necessary in

order to prove that the above discussed graphs are not influenced by the crawl size. Even so, the

limits they establish are not totally accurate, because of the random or targeted random selection

(e.g., towards top [0 2]% pages) of our experimental biasing sets.

Is biasing possible in the ODP context?

The URLs collected in the Open Directory are manually added Web pages supposed to

(1) cover the specific topic of the ODP tree leaf they belong to and


37/40

(2) be of high quality. Both requirements are not fully satisfied. Sometimes (rarely though) the

pages are not really representing the topic in which they were added. More

important for PageRank biasing, they usually cover a large interval of page ranks, which made us

decide for the random biasing model. However, we are aware that in this case, the human editors

chose much more high quality pages than low quality ones, and thus the decisions of the analysis

are susceptible to errors. Generally, according to the random model of biasing, every set with

TOT below 0.015% is good for biasing. According to this, all possible biasing sets analyzed in

tables 4, 5 and 3 would generate a different enough Page Rank vector9. We can therefore

conclude that biasing is (most probably) possible on all subsets of the Stanford Open Directory

crawl.


38/40

Web usage mining has been extensively used in order to analyze web log data. There exist

various methods based on data mining algorithms and probabilistic models. The related literature

is very extensive and many of these approaches fall out of the scope of this paper. For more

information, the reader may refer to. There exist many approaches for discovering sequences of

visits in a web site. Some of them are based on data mining techniques, whereas others use

probabilistic models, such as Markov models in order to model the users visits. Such approaches

aim at identifying representative trends and browsing patterns describing the activity in a web

site and can assist the web site administrators to redesign or customize the web site, or improve

the performance of their systems. They do not, however, propose any methods for personalizing

the web sites. There exist some approaches that use the aforementioned techniques in order to

personalize a web site. Contrary to our approach, these approaches do not distinguish between

different users or user groups in order to perform the personalization. Thus, the methods that

seem to be more relevant to ours, in terms of identifying different interest groups and personalize

the web site based on these profiles, are those that are based on collaborative filtering.

Collaborative filtering systems are used for generating recommendations and have been broadly

used in e-commerce. Such systems are based on the assumption that users with common interests

and behavior present similar searching/browsing behavior. Thus, the identification of similar

user profiles enables the filtering of relevant information and the generation of

recommendations.

Similar to such approaches, we also identify users with common interests and use this

information to personalize the topic directory. In our work, however, we do not model the user

profiles as vectors in order to find similar users. Instead, we use clustering to group users into

interest groups. Moreover, we propose the use of sequential pattern mining in order to generate

recommendations. Thus, we also capture the sequential dependencies within users visits,

whereas this is not the case with collaborative filtering systems. All of the aforementioned

approaches aim at personalizing generic web sites. Our approach focuses on the personalization

of a specific type of web sites, that of topic directories. Since topic directories organize web

content into meaningful categories, we can regard them as a form of digital library or portal. In

this context, we also overview here some approaches for personalizing digital libraries and web


39/40

portals. Some early approaches were based on explicit user input and the personalization services

provided are limited to simplified search functionalities or alerting services. propose the semi-

automatic generation of user recommendations based on implicit user input. In those approaches,

information is extracted from user accesses in the DL resources, and then is used for further

retrieval or filtering. As already mentioned, our approach does not limit its personalization

services on identifying the preferences of each individual user alone. Rather, we identify user

groups with common interests and behavior expressed by visits to certain categories and

information resources. This is enabled by approaches that are based on collaborative filtering.

Those approaches, however, fail to capture the sequential dependencies between the users visits,

as discussed previously.

MODELLING TOPIC DIRECTORIES

A topic directory is a hierarchical organization of thematic categories. Each category contains

resources (i.e., links to web pages). A category may have subcategories and/or related

categories. Subcategories narrow the content of broad categories. Related categories contain

similar resources, but they may exist in different places of the directory. Note that the related

relationship is bidirectional, that is, if category N is related to M, then M is also related to N. A

resource cannot belong to more than one category. We consider a graph representation of topic

directories. Definition 3.1. A topic directory D is a labelled graph G(V,E), where V is the set of

nodes and E the set of edges, such that: (a) each node in V corresponds to a category of D, and is

labelled by the category name, (b) for each pair of nodes (n,m) that corresponds to categories

(N,M), where N is subcategory of M in D, there is a directed edge from m to n, and (c) for each

pair of nodes (n,m) that corresponds to categories (N,M), where N and M are related categories

in D, there is a bi directed edge between n and m. The graph G(V,E) may also have shortcuts,

which are directed edges connecting nodes in V . Examples of such graphs are illustrated inFigure 4. The role of shortcuts as a means for personalizing the directory will be further

discussed in Section 5. The case study of Open Directory Project. In our work, we use the Open

Directory Project (ODP) as a case study. Figure 1 illustrates a part of the ODP directory. In


40/40

ODP, there are three types of categories: (a) subcategories (to narrow the content of broad

categories), (b) relevant categories (i.e., the ones appearing inside the see also section, and (c)

symbolic categories (i.e., denoted by the @ character after categorys name). Symboliccategories are subcategories that exist in different places of the directory. We consider relevant

categories as related categories, according to the Definition 3.1

Navigation patterns.

To represent the navigation behaviour of users when browsing the directory, we use the notion of

navigation patterns. A navigation pattern is the sequence of categories visited by a user during a

session. We note that such patterns may include multiple occurrences of the same categories.

This might be the result of users going back and forth within a path in the directory. Finally, we

also underline that during a session, a user may pursue more than one topic interests.

personalizing the web directories

Documents