personalizing the web directories

Upload: arffgf

Post on 07-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 Personalizing the Web Directories

    1/40

    Personalizing Web Directories with the Aid of Web Usage Data

    Literature Survey:

    Computational intelligence models for Personalization

    CI has been defined as the study of adaptive mechanisms to enable or facilitate intelligent

    behavior in complex and changing environments. This is an ongoing and evolving area of

    research since its term was coined by John McCarthy in 1956. Different CI models related to

    personalization are given in figure 1.

    Fuzzy Systems (FS) and Fuzzy Logic (FL) mimic the concept the way people think, that is, with

    reasoning rather than precise. Fuzzy methods were found to be instrumental in web-based

    personalization when used with WUM data. User profiles are processed using fuzzy approximate

    reasoning to recommend personalized URLs. Handling of user profiles with fuzzy concepts has

    been used by IR systems to provide users with personalized search engine results. Based on users

    web usage history data, fuzzy methods have been used to categorize or cluster web objects for

    web personalization. Fuzzy logic was used with collective or collaborate data mining techniquesto improve the quality of intelligent agents to provide personalized services to users .

    Evolutionary Algorithms (EA) use mechanisms inspired by biological evolution such as

    reproduction, mutation, recombination and selection. One of the most popular EA is Genetic

    Algorithms (GA). They mimic the gene structure in humans based on evolutionary theory. GA

    has been used to address some of the flaws of WUM and to tackle different problems such as

  • 8/4/2019 Personalizing the Web Directories

    2/40

    personalized search, IR, query optimization and document representation. GA was applied with

    user log mining techniques to get a better understanding of user preferences and discover

    associations between different URL addresses. By GA was included randomness in content

    filtering rather than strict adherence to predefined user profiles. This is known as the element of

    serendipity in IR. This modified GA was introduced for optimal design of a website based on a

    multiple optimization criteria taking download time, visualization and product association level

    into consideration. Artificial Neural Networks (ANN) or simply Neural Networks (NN) mimic

    the biological process of the human brain. A NN can be trained to group users into specified

    categories or into clusters.

    This is useful in personalization as each user group may possess similar preferences and

    hence the content of a web interface can be adapted to each group. NNs can also be trained to

    learn the behavior of website users. Inputs for this learning can be derived from WUM data and

    CF techniques. The learning ability of neural networks can also be used for real time adaptive

    interaction instead of only common content and static based personalization. A NN was used to

    construct user profiles. A NN was implemented to categorize e-mail folder. Swarm Intelligence

    (SI) is based on the collective behavior of animals in nature such as birds, ants, bees and wasps.

    Particle Swarm Optimization (PSO) models the convergence behavior of a flock of birds. PSO

    was used for analyzing unique behavior of web user for manipulation of web access log data and

    user profile data. Personalized recommendation based on individual user preferences or CF data

    has also been explored using PSO. This was done by building up profiles of users and then using

    an algorithm to find profiles similar to the current user by supervised learning. Personalized and

    automatic content sequencing of learning objects was implemented using PSO. Research has also

    been done using PSO as a clustering algorithm but no use of this approach to clustering was

    found in relation to website personalization.

    Another SI technique is Ant Colony Optimization (ACO) which models the behavior of

    ants that leave the nest to wander randomly in search of food and when it is found they leave a

    trail of pheromone when returning to the colony. ACO resulted in the development of the

    shortest path optimization algorithms and has applications in routing optimization. ACO has

    been used to classify web users in WUM (cAnt-WUM algorithm) allowing personalization of the

    web system to each user class. Bees Colony Optimization (BCO) is built on basic principles of

  • 8/4/2019 Personalizing the Web Directories

    3/40

    collective bee intelligence. It has been applied to web-based systems to improve the IR systems

    of search engines incorporating WUM data, however the issue of personalization has not yet

    known to be directly addressed. Wasp Colony optimization (WCO) or Wasp Swarm

    Optimization (WSO) has not yet been exploited in comparison to the other SI methods. It models

    the behavior of insect wasps in nature. WCO has also been applied to the NP-hard optimization

    problem known as the Multiple Recommendations Problem (MRP). It occurs when several

    personalized recommendations are running simultaneously and results in churning where a user

    is presented with uninteresting recommendations. Further research has to be done however, using

    WCO on real, scalable and dynamic data sets. Artificial Immune Systems (AIS) mimic the

    functioning of the human immune system as the body learns to handle antigens by producing

    antibodies based in previous experience. Applications of AIS have been solving pattern

    recognition problems, classification tasks, cluster data and anomaly detection. Already AIS has

    been applied to personalization of web-based systems. The human body is represented by a

    website, incoming web requests are antigens and learning is paralleled to the learning of the

    immune systems to produce the right antibodies to combat each antigen. Using this analogy and

    AIS based on WUM was used as a learning system for a website. It is common practice to

    combine CI techniques to create a hybrid which seeks to overcome the weakness of one

    technique with the strength of another. Several hybrids were applied to personalization of web

    based systems. NN was combined with FL to give a hybrid Neuro -Fuzzy strategy for Web

    personalization. The topology and parameters of NN were used to obtain the structure and

    parameters of fuzzy rules.

    The learning ability of NN was then applied to this set of rules. The ability of

    evolutionary techniques such as GA, to extract implicit information from user logs was

    combined with fuzzy techniques to include vagueness in decision making. This FL-GA hybrid

    allows more accurate and flexible modeling of user preferences. User data obtained from web

    usage data is the input for a NN. The weights and fitness functions derived from NN training isoptimized using GA to derive classification rules to govern personalized decision making in e-

    Business. A fuzzy-PSO approach was introduced to personalize Content Based Image Retrieval

    (CBIR). User logs were analyzed and used as the PSO input. Fuzzy principles were applied to

    the PSO velocity, position and weight parameters.

  • 8/4/2019 Personalizing the Web Directories

    4/40

    Personalization of web-based systems using CI models

    Based on the eight major CI methods described above, it is noticed that WUM is the common

    input for all models. Data mining in a sense provides the fuel for personalization using CI

    methods. CI methods are comparable to taxonomy of intelligent agents for personalization.

    Building on ideas from this approach taxonomy for personalization of web-based systems was

    proposed (cf. Fig. 2). Two main uses are identified for CI methods when applied to

    personalization: profile generation and profile exploitation. User profiles can further be used to

    personalize either the navigation or content of web based systems.

    Profile generation

    Profile generation is the creation of user profiles based on both implicit WUM data and explicit

    user preferences. User profiles can be generated either per individual or group users which

    appear to have similar previous web usage habits using CF techniques. Five CI methods found in

    previous work which were applied to user profile generation of web based systems are: FL, NN,

    PSO, ACO and AIS. FL models are constructed to identify ambiguity in user preferences

    however there are many ways of interpreting fuzzy rules and translating human knowledge into

    formal controls can be challenging. NN was trained to identify similarities in user behavior

    however for proper training the sample size must be large and the NN can be complex due to

    over fitting. Both PSO and GA were used to link users behavior by profile-matching but PSO

    was found to outperform GA in terms of speed, execution and accuracy. ACO was used to model

    users with relative accuracy and simplicity; however its computational complexity causes long

    computing time. PSO approach was found to be faster when compared to ACO. AIS was used to

    dynamically adapt profiles to changing and new behaviors. The theoretical concept of AIS is not

    fully sound however, since in reality other human systems support the functioning of the immune

    system and these are not modeled. The artificial cells in AIS do not work autonomously therefore

    the success or fail of one part of the system may determine the performance of the following

    step.

  • 8/4/2019 Personalizing the Web Directories

    5/40

    A hybrid method uses GA to optimize the input values of a NN, to maximize the output. In this

    way the slow learning process of NN is helped with the optimization ability of GA.

    Profile exploitation

    Profile exploitation personalizes various aspects of a web-based system by predefined user

    profiles. Two main approaches to personalize web based systems were identified as

    personalization of navigation and personalization of content (cf. fig.2).

    Personalized navigation

    Personalized navigation includes WUM for personalized IR, such as search engine results, and

    URL recommendations. FL, BCO and GA were three main CI methods found for navigation

    personalization (cf. fig.2). FL was used for offline processing to recommend URLs to users. It is

    relatively fast, deal with natural overlap in user interests and suitable for real time

    recommendations. Various FL testing however showed slightly lower precision and harder to

    program for the fuzzy part. GA was applied for search and retrieval but is it known to be more

    general and abstract than other optimization methods and does not always provide the optimal

    solution. BCO was used for IR but it is not a widely covered area of research and currently there

    is a better theoretical than experimental understanding. ACO is similar to BCO and has seen

    more successful applications. A hybrid between GA and FL was applied to this area. Fuzzy set

    techniques were used for better document modeling and genetic algorithms for query

    optimization to give personalized search engine results. A Neuro-Fuzzy method combined the

  • 8/4/2019 Personalizing the Web Directories

    6/40

    learning ability of NN with the representation of vagueness in Fuzzy Systems to overcome the

    NN black-box behavior and present more meaningful results than FL alone.

    Personalized content

    Personalized content refers to WUM for personalized web objects on each web page and

    sequence of content. FL, NN, GA, PSO and WCO were the main CI techniques found with

    applications in this area (cf. fig.2). FL was used for a web search algorithm and to automate

    recommendations to ecommerce customers. It was found to be flexible and able to support

    ecommerce application. NN was used to group users into clusters for content recommendations

    however over fitting problem still exists today GA was applied to devise the best arrangement of

    web objects. It was found to be scalable; however it is suggest to be used in collaboration with

    other data mining tools. PSO was used to sequence Learning Objects and was chosen because of

    relative small number of parameters compared with other techniques such as GA. PSO parameter

    selection is also a well researched area. Using a modified PSO for data clustering was found to

    give accurate results. WCO was applied on the churning problem of uninteresting content

    recommendations to users. This is mostly a theoretical concept, not well tested on real data and

    other biological inspired algorithms have found more success such as ACO. Fuzzy-PSO was

    created to help improve the effectiveness of standard PSO particle movement in a content based

    system.

    PROBABILISTIC LATENT SEMANTIC MODELS OF WEB USER NAVIGATIONS

    The overall process of Web usage mining consists of three phrases: data preparation and

    transformation, pattern discovery, and pattern analysis. The data preparation phase transforms

    raw Web log data into transaction data that can be processed by various data mining tasks. In the

    pattern discovery phase, a variety of data mining techniques, such as clustering, association rule

    mining, and sequential pattern discovery can be applied to the transaction data. The discovered

    patterns may then be analyzed and interpreted for use in such applications as Webpersonalization. The usage data preprocessing phase [8, 32] results in a set of n page views, P =

    {p1, p2, . . . , pn} and a set of m user sessions, U = {u1, u2, . . . , um}. A page view is an

    aggregate representation of a collection of Web objects (e.g. pages) contributing to the display on

    a users browser resulting from a single user action (such as a click through, product purchase, or

    database query). The Web session data can be conceptually viewed as an m n session-page

  • 8/4/2019 Personalizing the Web Directories

    7/40

    view binary matrix UP = [w(ui, pj )]mn, where w(ui, pj) represents the weight of page view pj

    in a user session ui. The weights can be binary, representing the existence or non-existence of the

    page view in the session, or they may be a function of the occurrence or duration of the page

    view in that session. PLSA is a latent variable model which associates hidden (unobserved)

    factor variable Z = {z1, z2, ..., zl} with observations in the co-occurrences data. In our context,

    each observation corresponds to an access by a user to a Web resource in a particular session

    which is represented as an entry of the m n co-occurrence matrix UP. The probabilistic latent

    factor model can be described as the following generative model:

    1. select a user session ui from U with probability Pr(ui),

    2. pick a latent factor zk with probability Pr(zk|ui),

    3. Generate a page view pj from P with probability Pr(pj|zk).

    As a result we obtain an observed pair (ui, pj ), while the latent factor variable zk is

    discarded. Translating this process into a joint probability model results in the following:

    Summing over all possible choices of zk from which the observation could have been generated.

    Using Bayes rule, it is straightforward to transform the joint probability into:

    Now, in order to explain a set of observations (U, P), we need to estimate the parameters Pr(zk),

    Pr(ui|zk), Pr(pj |zk), while maximizing the following likelihood L(U, P) of the observations,

  • 8/4/2019 Personalizing the Web Directories

    8/40

    Expectation-Maximization (EM) algorithm is a well known approach to performing maximum

    likelihood parameter estimation in latent variable models. It alternates two steps:

    (1) an expectation (E) step where posterior probabilities are computed for latent variables, based

    on the current estimates of the parameters,

    (2) a maximization (M) step, re-estimate the parameters in order to maximize the expectation of

    the complete data likelihood. The EM algorithm begins with some initial values of Pr(zk),

    Pr(ui|zk), and Pr(pj |zk). In the expectation step we compute:

    In the maximization step, we aim at maximizing the expectation of the complete data likelihood

    E(LC),

    While taking into account the constraints, l k=1 Pr(zk) = 1, on the factor probabilities, as well as

    the following constraints on the two conditional probabilities:

  • 8/4/2019 Personalizing the Web Directories

    9/40

    Through the use of Lagrange multipliers (see for details), we can solve the constraint

    maximization problem to get the following equations for re-estimated parameters:

    Iterating the above computation of expectation and maximization steps monotonically increases

    the total likelihood of the observed data L(U, P) until a local optimal solution is reached. The

    computational complexity of this algorithm is O(mnl), where m is the number of user sessions, n

    is the number of page views, and l is the number of factors. Since the usage observation matrix

    is, in general, very sparse, the memory requirements can be dramatically reduced using efficient

    sparse matrix representation of the data.

    DISCOVERY AND ANALYSIS OF USAGE PATTERN WITH PLSA

    One of the main advantages of PLSA model in Web usage mining is that it generates

    probabilities which quantify relationships between Web users and tasks, as well as Web pages

    and tasks. From these basic probabilities, using probabilistic inference, we can derive

    relationships among users, among pages, and between users and pages. Thus this framework

    provides a flexible approach to model a variety of types of usage patterns. In this section, we will

    describe various usage patterns that can be derived using the PLSA model. As noted before, the

    PLSA model generates probabilities Pr(zk), which measures the probability of a certain task is

    chosen; Pr(ui|zk), the probability of observing a user session given a certain task; and Pr(pj |zk),

    the probability of a page being visited given a certain task. Applying Bayes rule to these

    probabilities, we can generate the probability that a certain task is chosen given an observed user

    session:

  • 8/4/2019 Personalizing the Web Directories

    10/40

    and the probability that a certain task is chosen given an observed page view:

    In the following, we discuss how these models can be used to derive different kinds of usage

    patterns. We will provide several illustrative examples of such patterns, from real Web usage

    data, in Section 4.

    Characterizing Tasks by Page views or by User Sessions

    Capturing the tasks or objectives of Web users can help the analyst to better understand these

    users preferences and interests. Our goal is to characterize each task, represented by a latent

    factor, in a way that is easy to interpret. One possible approach is to find the prototypical pages

    that are strongly associated with a given task, but that are not commonly identified as part of

    other tasks. We call each such page a characteristic page for the task, denoted by pch. This

    definition of prototypical has two consequences; first, given a task, a page which is seldom

    visited cannot be a good characteristic page for that task. Secondly, if a page is frequently visited

    as part of a certain task, but is also commonly visited in other tasks, the page is not a good

    characteristic page. So we define characteristic pages for a task zk as the set of all pages, pch,

    which satisfy:

    Where is a predefined threshold. By examining the characteristic pages of each task, we can

    obtain a better understanding of the nature of these tasks. Characterizing tasks in this way can

    lead to several applications. For example, most Web sites allow users to search for relevant

    pages using keywords. If we also allow users to explicitly express their intended task(s) (via

    inputting task descriptions or choosing from a task list), we can return the characteristic pages for

    the specified task(s), which are likely to lead users directly to their objectives. A similar

  • 8/4/2019 Personalizing the Web Directories

    11/40

    approach can be used to identify prototypical user sessions for each task. We believe that a

    user session involving only one task can be considered as the characteristic session for the task.

    So, we define the characteristic user sessions, uch, for a task, zk, as sessions which satisfy

    where is a predefined threshold. When a user selects a task, returning such exemplar sessions

    can provide a guide to the user for accomplishing the task more efficiently. This approach can

    also be used in the context of collaborative filtering to identify the closest neighbors to a user

    based on the tasks performed by that user during an active session.

    User Segments Identification

    Identifying Web user groups or segments is an important problem in Web usage mining. It helps

    Web site owners to understand and capture users common interests and preferences. We can

    identify user segments in which users perform common or similar task, by making inferences

    based on the estimated conditional probabilities obtained in the learning phase. For each task zk,

    we choose all user sessions with probability Pr(ui|zk) exceeding a certain threshold to get a

    session set C. Since each user sessions, can also be represented as a page view vector, we can

    further aggregate these users sessions into a single page views vector to facilitate interpretation.

    The algorithm of generating user segments is as follows:

    1. Input: Pr(ui|zk), user session-page matrix UP and threshold .

    2. For each zk, choose all the sessions with Pr(ui|zk) to get a candidate session set C.

    3. For each zk, compute the weighed average of all the chosen sessions in set Cto get a page

    vector defined as:

  • 8/4/2019 Personalizing the Web Directories

    12/40

    4. For each factor zk, output page vector This page vector consists of a set of weights, for

    each page view in P, which represents the relative visit frequency of each page view for this user

    segment. We can sort the weights so that the top items in the list correspond to the most

    frequently visited pages for the user segment. These user segments provide an aggregate

    representation of all individual users navigational activities in the a particular group. In addition

    to their usefulness in Web analytics, user segments also provide the basis for automatically

    generating item recommendations. Given an active user, we compare her activity to all user

    segments and find the most similar one. Then, we can recommend items (e.g., pages) with

    relatively high weights in the aggregate representation of the segment. In Section 4, we conduct

    experimental evaluation of the user segments generated from two real Web sites.

    Identifying the Underlying Tasks of a User Session

    To better understand the preferences and interests of a single user, it is necessary to identify the

    underlying tasks performed by the user. The PLSA model provides a straightforward way to

    identify the underlying tasks in a given user session. This is done by examining Pr(task|session),

    which is the probability of a task being performed, given the observation of a certain user

    session. For a user session u, we select the top tasks zk with the highest Pr(zk|u) values, as the

    primary task(s) performed by this user. For a new user session, unew, not appearing in the

    historical navigational data, we can adopt a folding-in method as introduced in to generatePr(task|session) via the EM algorithm. In the E-step, we compute

    Here, w(unew, p) represents the new users visit frequency on the specified page p. After we

    generate these probabilities, we can use the same method to identify the primary tasks for the

    new user session. The identification of the primary tasks contained in user sessions can lead to

    further analysis. For example, after identifying the tasks in all user sessions, each session u can

    be transformed into a higher-level representation,

  • 8/4/2019 Personalizing the Web Directories

    13/40

    where zi denotes task i and wi denotes Pr(zi|u). This, in turn, would allow the discovery and

    analysis of task-level usage patterns, such as determining which tasks are likely to be visited

    together, or which tasks are most (least) popular, etc. Such higher-level patterns can help site

    owners better evaluate the Web site organization.

    Integration of Usage Patterns with Web Content Information

    Recent studies have emphasized the benefits of integrating semantic knowledge about the

    domain (e.g., from page content features, relational structure, or domain ontologies) in the Web

    usage mining process. The integration of content information about Web objects with usage

    patterns involving those objects provides two primary advantages. First, the semantic

    information provides additional clues about the underlying reasons for which a user may or may

    not be interested in particular items. Secondly, in cases where little or no rating or usage

    information is available (such as in the case of newly added items, or in very sparse data sets),

    the system can still use the semantic information to draw reasonable conclusions about user

    interests. The PLSA model described here also provides an ideal and uniform framework for

    integrating content and usage information. Each page view contains certain semantic knowledge

    represented by the content information associated with that page view.

    By applying text mining and information retrieval techniques, we can represent each page

    view as an attribute vector. Attributes may be the keywords extracted from the page views, or

    structured semantic attributes of the Web objects contained in the page views. As before, we

    assume there exists a set of hidden factors z Z = {z1, z2, ..., zl}, each of which represents a

    semantic group of pages. They can be a group of pages which have similar functionalities for

    users performing a certain task, or a group of pages which contain similar content information or

    semantic attributes. However, now, in addition to the set of page views, P, and the set of usersessions, U, we also specify a set of t semantic attributes, A = {a1, a2, . . . , at}. To model the

    user-page observations, we use

  • 8/4/2019 Personalizing the Web Directories

    14/40

    These models can then be combined based on the common component Pr(pj |zk). This can be

    achieved by maximizing the following log-likelihood function with a predefined weight .

    where is used to adjust the relative weights of two observations. The EM algorithm can again

    be used to generate estimates for Pr(zk), Pr(ui|zk), Pr(pj |zk), and Pr(aq|zk). By applying

    probabilistic inferences, we can measure the relationships among users, pages, and attributes,

    thus we are able to answer questions such as, What are the most important attributes for a group

    of users?, or Given an Web page with a specified set of attributes, will it be of interest to a

    given user?, and so on.

    EXPERIMENTS WITH PLSA MODEL

    In this section, we use two real data sets to perform experiments with our PLSA-based Web

    usage mining framework. We first provide several illustrative examples of characterizing users

    tasks, as introduced in the previous section, and of identifying the primary tasks in an individual

    user session. We then perform two types of evaluations based on the generated user segments.

    First we evaluate individual user segments to determine the degree to which they represent

    activities of similar user. Secondly, we evaluate the effectiveness of these user segments in the

    context of generating automatic recommendations. In each case, we compare our approach with

    the standard clustering approach for the discovery of Web user segments. In order to compare the

    clustering approach to the PLSAbased model, we adopt the algorithm presented in for creating

    aggregate profiles based on session clusters. In the latter approach, first, we apply a

  • 8/4/2019 Personalizing the Web Directories

    15/40

    multivariate clustering technique such as k-means to user-session data in order to obtain a set of

    user clusters TC = {c1, c2, ..., ck}; then, an aggregate representation, prc, is generated for each

    cluster c as a set of page view-weight pairs:

    where the significance weight, weight is given by weight(p, prc) = (1/|c|)uc w(p, u) and

    w(p, u) is the weight of page view p of the user session u c. Thus, each segment is represented

    as a vector in the page view space. In the following discussion, by a user segment, we mean its

    aggregate representation as a page view vector.

    Data Sets

    In our experiments, we use Web server log data from two Web sites. The first data set is based

    on the server log data from the host Computer Science department. This Web site provide

    various functionalities to different types of Web users. For example, prospective students can

    obtain program and admissions information or submit online applications. Current students can

    browse course information, register for courses, make appointments with faculty advisors, and

    log into the Intranet to do degree audits. Faculty can perform student advising functions online or

    interact with the faculty Intranet. After data preprocessing, we identified 21,299 user sessions

    (U) and 692 pageviews (P), with each user session consisting of at least 6 pageviews. This data

    set is referred to as the CTI data. The second data set is from the server logs of a local affiliate

    of a national real estate company. The primary function of the Web site is to allow prospective

    buyers to visit various pages and information related to some 300 residential properties. The

    portion of the Web usage data during the period of analysis contained approximately 24,000 user

    sessions from 3,800 unique users. During preprocessing, we recorded each user-property pair

    and the corresponding visit frequency. Finally, the data was filtered to limit the final data set to

    those users that had visited at least 3 properties. In our final data matrix, each row represented a

    user vector with properties as dimensions and visit frequencies as the corresponding dimension

    values. We refer to this data set as the Realty data. Each data set was randomly divided into

    multiple training and test sets to use with 10-fold cross-validation. By conducting sensitivity

    analysis, we chose 30 factors in the case of CTI data and 15 factors for the Realty data. To avoid

    overtraining, we implemented the Tempered EM algorithm to train the PLSA model.

  • 8/4/2019 Personalizing the Web Directories

    16/40

    Examples Usage Patterns Based on the PLSA Models

    Figure 1 depicts an example of the characteristic pages for a specific discovered task in the CTI

    data. The first 6 pages have the highest Pr(page|task)Pr(task|page) values, thus are considered as

    the characteristic pages of this task. Observing these characteristic pages, we may infer that this

    task corresponds to prospective students who are completing an online admissions application.

    Here characteristic has two implications. First, if a user wants to perform this task, he/she must

    visit these pages to accomplish his/her goal. Secondly, if we find a user session contains these

    pages, we can claim the user must have performed online application. Some page may not be

    characteristic pages for the task, but may still be useful for the purpose of analysis. An example

    of such a page is the /news/ page which has a relatively high Pr(page|task) value, and a low

  • 8/4/2019 Personalizing the Web Directories

    17/40

    Pr(task|page) value. Indeed, by examining the at the site structure, we found that this page serves

    as a navigational page, and it can lead users to different sections of the site to perform different

    tasks (including the online application). This kind of discovery can help Web site designer to

    identify the functionalities of pages and reorganize Web pages to facilitate users navigation.

    Figure 2 identifies three tasks in the Realty data. In contrast to the CTI data, in this data set the

    tasks represent common real estate properties visited by users, thus reflecting user interests in

    similar properties. The similarities are clearly observed when property attributes are shown for

    each characteristic page. From the characteristic pages of each task, we infer that Task 4

    represents users interest in newer and more expensive properties, while Task 0 reflects interest

    in older and very low priced properties. Task 5 represents interest in properties midrange prices.

    We can also identify prototypical users corresponding to specific tasks. An example of such a

    user session is depicted in Figure 3 corresponding to yet another task in the realty data which

    reflects interest in very high priced and large properties (task not shown here).

  • 8/4/2019 Personalizing the Web Directories

    18/40

    Our final example is this section shows how the prominent tasks contained in a given user

    session can be identified. Figure 4 depicts a random user session from CTI data. Here we only

    show the tasks IDs which have the highest probabilities Pr(task|session). As indicated, the

    dominant tasks for this user session are Tasks 3 and 25. The former is, in fact, the online

    application task discussed earlier, and the latter is a task that represents international students

    who are considering applying for admissions. It can be easily observed that, indeed, this session

    seems to identify an international student who, after checking admission and visa requirements,

    has applied for admissions online.

  • 8/4/2019 Personalizing the Web Directories

    19/40

    Evaluation of User Segments and Recommendations

    We used two metrics to evaluate the discovered user segments. The first is called the Weighted

    Average Visit Percentage (WAVP). WAVP allows us to evaluate each segment individually

    according to the likelihood that a user who visits any page in the segment will visit the rest of the

    pages in that segment during the same session. Specifically, let T be the set of transactions in the

    evaluation set, and for a segment s, let Ts denote a subset of T whose elements contain at least

    one page from s. The weighted average similarity to the segment s over all transactions is then

    computed (taking both the transactions and the segments as vectors

  • 8/4/2019 Personalizing the Web Directories

    20/40

    Note that a higher WAVP value implies better quality of a segment in the sense that the segment

    represents the actual behavior of users based on their similar activities. For evaluating the

    recommendation effectiveness, we use a metric called Hit Ratio in the context of top-N

    recommendation.

    For each user session in the test set, we took the first K pages as a representation of an

    active session to generate a top-N recommendation set. We then compared the recommendations

    with the pageview (K +1) in the test session, with a match being considered a hit. We define the

    Hit Ratio as the total number of hits divided by the total number of user sessions in the test set.

    Note that the Hit Ratio increases as the value of N (number of recommendations) increases.

    Thus, in our experiments, we pay special attention to smaller number recommendations (between

    1 and 20) that result in good hit ratios. Note that a higher WAVP value implies better quality of a

    segment in the sense that the segment represents the actual behavior of users based on their

    similar activities. For evaluating the recommendation effectiveness, we use a metric called Hit

    Ratio in the context of top-N recommendation. For each user session in the test set, we took the

    first K pages as a representation of an active session to generate a top-N recommendation set.

    We then compared the recommendations with the page view (K +1) in the test session, with a

    match being considered a hit. We define the Hit Ratio as the total number of hits divided by the

    total number of user sessions in the test set. Note that the Hit Ratio increases as the value of N

    (number of recommendations) increases. Thus, in our experiments, we pay special attention tosmaller number recommendations (between 1 and 20) that result in good hit ratios.

  • 8/4/2019 Personalizing the Web Directories

    21/40

    In the first set of experiments we compare the WAVP values for the generated segments using

    the PLSA model and those generated by the clustering approach. Figures 5 and 6 depict theseresults for the CTI and Realty data sets, respectively. In each case, the segments are ranked in the

    decreasing order of WAVP. The results show clearly that the probabilistic segments based on the

    latent factor factors provides a significant advantage over the clustering approach. In the second

    set of experiments we compared the recommendation accuracy of the PLSA model with that of

    kmeans clustering segments. In each case, the recommendations are generated according to the

  • 8/4/2019 Personalizing the Web Directories

    22/40

    recommendation algorithm presented in Section 3.2. The recommendation accuracy is measured

    based on hit ratio for different number of generated recommendations. These results are depicted

    in Figures 7 and 8 for the CTI and Realty data sets, respectively. Again, the results show a clear

    advantage for the PLSA model. In most realistic situations, we are interested in a small, but

    accurate, set of recommendations. Generally, a reasonable recommendation set might contain 5

    to 10 recommendations. Indeed, this range of values seem to represent the largest improvements

    of the PLSA model over the clustering approach.

    ODP: The Open DirectoryProject

    Description. The DMOZ Open Directory Project (ODP) [20] is the largest, most

    comprehensive human-edited web page catalog currently available. It covers 4 million sites filed

    into more than 590,000 categories (16 wide-spread top-categories, such as Arts, Computers,

    News, Sports, etc.) Currently, there are more than 65,000 volunteering editors maintaining it.

    ODPs data structure is organized as a tree, where the categories are internal nodes and pages are

    leaf nodes. By using symbolic links, nodes can appear to have several parent nodes. Since ODP

    truly is free and open, everybody can contribute or re-use the dataset, which is available in RDF

    (structure and content are available separately). Google for example uses ODP as basis for its

    Google Directory service.

    Applications

    Besides its re-use in other directory services, the ODP taxonomy is used as a basis for various

    other research projects. In Persona, ODP is applied to enhance HITS with dynamic user profiles

    using a tree coloring technique (by keeping track of the number of times a user has visited

    pages of a specific category). Users can rate a page as being good or unrelated regarding

    their interest. This data is then used to rank and omit interesting/unwanted results. While asks

    users for feedback, we only rely on user profiles, i.e., a one-time user interaction. More, we do

    not develop our search algorithm on top of HITS, but on top of any search algorithm, as a

    refinement. In, a similar approach using the ODP taxonomy is applied onto a recommender

    system of research papers. The Open Directory can also be used as a reference source containing

    good pages, to fight web spam containing uninteresting URLs through white listing, as a web

    corpus for comparisons of rank algorithms, as well as for focused crawling towards special-

  • 8/4/2019 Personalizing the Web Directories

    23/40

    interest pages. Unfortunately, the free availability of ODP also has its downside. A clone of the

    directory modified to contain some spam pages could trap people to link to this fake directory,

    which results in an increased ranking not only for this directory clone, but also for the injected

    spam pages.

    Page Rank and Personalized Page Rank

    Page Rank computes Web page scores based on the graph inferred from the link structure of the

    Web. It is based on the idea that a page has high rank if the sum of the ranks of its back links is

    high. Given a page p, its input I(p) and output O(p) sets of links, the Page Rank formula is:

    The dampening factor c < 1 (usually 0.15) is necessary to guarantee convergence and to limit the

    effect of rank sinks [2]. Intuitively, a random surfer will follow an outgoing link from the current

    page with probability (1 c) and will get bored and select a random page with probability c (i.e.,

    the E vector has all entries equal to 1/N, where N is the number of pages in the Web graph).

    Initial steps towards personalized page ranking are already described by who proposed a slight

    modification of the above presented algorithm to redirect the random surfer towards preferred

    pages using the E vector. Several distributions for this vector have been proposed since.

    Topic-sensitive Page Rank

    Haveliwala builds a topic oriented Page Rank, starting by computing off-line a set of 16 Page-

    Rank vectors biased on each of the 16 main topics of the Open Directory Project. Then, the

    similarity between a user query and each of these topics is computed, and the 16 vectors are

    combined using appropriate weights. Personalized Page Rank. A more recent investigation, uses

    a different approach: it focuses on user profiles. One Personalized Page Rank Vector (PPV) is

    computed for each user. The personalization aspect of this algorithm stems from a set of hubs

    (H)1, each user having to select her preferred pages from it. PPVs can be expressed as a linear

    combination of PPVs for preference vectors with a single non-zero entry corresponding to each

  • 8/4/2019 Personalizing the Web Directories

    24/40

    of the pages from the preference set (called basis vectors). The advantage of this approach is that

    for a hub set of N pages, one can compute 2N Personalized Page Rank vectors without having to

    run the algorithm again, unlike, where the whole computation must be performed for each

    biasing set. The disadvantages are forcing the users to select their preference set only from

    within a given group of pages (common to all users), as well as the relatively high computation

    time for large scale graphs.

    USING ODP METADATA FOR PERSONALIZED SEARCH

    Motivation. We presented in Section 2.2 the most popular approaches to personalizing Web

    search. Even though they are the best so far, they all have some important drawbacks. In, we

    need to run the entire algorithm for each preference set (or biasing set), which is practically

    impossible in a large-scale system. At the other end, computes biased PageRank vectors limited

    only to the broad 16 top-level categories of the ODP, because of the same problem. Improves

    this somewhat, allowing the algorithm to bias on any subset of a given set of pages (H).

    Although work has been done in the direction of improving the quality of this latter set [4], one

    limitation is still that the preference set is restricted to a subset of this given set H (if H = {CNN,

    FOX News} we cannot bias on MSNBC for example). More importantly, the bigger H is, the

    more time is needed to run the algorithm. Thus finding 1Note that hubs were defined here as

    pages with high Page Rank, differently from the more popular definition from.

    a simpler and faster algorithm with at least similar personalization granularity is still a worthy

    goal to pursue. In the following we make another step towards this goal. Introduction. Our first

    step was to evaluate how ODP search compares with Google search, specifically exploiting the

    fact that all ODP entries are categorized into the ODP topic hierarchy. We started with the

    following two observations: 1. given the fact that ODP just includes 4 million entries, and the

    Google database includes 8 billion, does ODP-based search stand a chance of being comparable

    to Google? 2. ODP advanced search offers a rudimentary personalized search feature by

    restricting the search to the entries of just one of the 16 main categories. Google directory offers

    a related feature, by offering to restrict search to a specific category or subcategory. Can we

    improve this personalized search feature, taking the user profile into account in a more

  • 8/4/2019 Personalizing the Web Directories

    25/40

    sophisticated way, and how does such an enhanced personalized search on the ODP or Google

    entries compare to ordinary Google results? Most people would probably answer (1) No, not

    yet, and (2) Yes. In the following Section we will prove the correctness of the second answer

    by introducing a new personalized search algorithm, and then we will concentrate on the first

    answer in the experiments Section.

    Algorithm

    Our algorithm is exploiting the annotations accumulated in generic large-scale annotations such

    as the Open Directory. Even though we concentrate our forthcoming discussion on ODP,

    practically any similar taxonomy can be used. These annotations can be easily used to achieve

    personalization, and can also be combined with the initial Page Rank algorithm. We define user

    profiles using a simple approach: each user has to select several topics from the ODP, which best

    fit her interests. For example, a user profile could look like this:

    Then, at run-time, the output given by a search service (from Google, ODP Search, etc.) is re-

    sorted using a calculated distance from the user profile to each output URL. The execution is

    also depicted in Algorithm 3.1.

  • 8/4/2019 Personalizing the Web Directories

    26/40

    Distance Metrics When performing search on Open Directory, each resulting URL comes with

    an associated ODP topic. Similarly, a good amount of the URLs output by Google is connected

    to one or more topics within the Google Directory (almost 50%, as discussed in Section 3.2).

    Therefore, in both cases, for each output URL we are dealing with two sets of nodes from the

    topic tree: (1) Those representing the user profile (set A), and (2) those associated with the URL

    (set B). The distance between these sets can then be defined as the minimum distance between

    all pairs of nodes given by the Cartesian product A B. Finally, there are quite a few

    possibilities to define the distance between two nodes. Even though, as we will see from the

    experiments, the simplest approaches already provide very good results, we are now performing

    an optimality study2 to determine which metric best fits this kind of search. In the following, we

    will present our best solutions so far. Nave Distances. The simplest solution is the minimum

    tree distance, which, given two nodes a and b, returns the sum of the minimum number of tree

    edges between a and the sub sumer (the deepest node common to both a and b) plus the

    minimum number of tree edges between b and the subsumer (i.e., the shortest path between a and

    b). On the example from Figure 1, the distance between /Arts/Architecture and

    /Arts/Design/Interior Design/Events/Competitions is 5, and the subsumer is /Arts. If we also

    consider the inter-topic links from the Open Directory, the simplest distance becomes the graph

  • 8/4/2019 Personalizing the Web Directories

    27/40

    shortest path between a and b. For example, if there is a link between Interior Design and

    Architecture in Figure 1, then the distance between Competitions and Architecture is 3. This

    solution implies to load either the entire topic graph or all the inter-topic links into memory.

    Furthermore, its utility is subjective from user to user: the existence of a link between

    Architecture and Interior Design does not always imply that a famous architect (one level below

    in the tree) is very close to the area of interior design. We can consider these links in our metric

    in three ways: 1. Consider the graph containing all intra-topic links and output the shortest path

    between a and b. 2. Consider graph containing only the intra-topic links directly connected to a

    and b and output the shortest path. 2We refer the reader to for an in-depth view of the approach

    we took in this study. 3. If there is an intra-topic link between a and b, output 1. Otherwise,

    ignore all intra-topic links and output the tree distance between a and b. Complex Distances. The

    main drawback of the above metrics comes from the fact that they ignore the depth of the

    subsumer. The bigger this depth is, the more related are the nodes (i.e., the concepts represented

    by them). This problem is solved by, which investigates ten intuitive strategies for measuring

    semantic similarity between words using hierarchical semantic knowledge bases such as Word

    Net [18]. Each of them was evaluated experimentally on a group of testers, the best one having a

    0.9015 correlation between the human judgment and the following formula:

    The parameters are as follows:

    and were defined as 0.2 and 0.6 respectively, h is the tree-depth of the sub sumer, and l is the

    semantic path length between the two words. Considering we have several words attached to

    each concept and sub-concept, then l is 0 if the two words are in the same concept, 1 if they are

    in different concepts, but the two concepts have at least one common word, or the tree shortest

    path if the words are in different concepts which do not contain common words. Although thismeasure is very good for words, it is not perfect when we apply it to the Open Directory topical

    tree because it does not make a difference between the distance from a (the profile node) to the

    subsumer, and the distance from b (the output URL) to the subsumer. Consider node a to be

    /Top/Games and b to be /Top/Computers/Hardware/Components/Processors/x86. A teenager

    interested in computer games (level 2 in the ODP tree) could be very satisfied receiving a page

  • 8/4/2019 Personalizing the Web Directories

    28/40

    about new processors (level 6 in the tree) which might increase his gaming quality. On the other

    hand, the opposite scenario (profile on level 6 and output URL on level 2) does not hold any

    more, at least not to the same extent: a processor manufacturer will generally be less interested in

    the games existing on the market. This leads to our following extension of the above formula:

    with l1 being the shortest path from the profile to the subsumer, l2 the shortest path from the

    URL to the subsumer, and a parameter in [0, 1]. Combining the Distance Function with Google

    Page Rank. And yet something is still missing. If we use Google to do the search and then sort

    the URLs according to the Google Directory taxonomy, some high quality pages might be

    missed (i.e., those which are top ranked, but which are not in the directory). In order to integrate

    that, the above formula could be combined with the Google Page Rank. We propose the

    following approach:

    Conclusion. Human judgment is a non-linear process over information sources, and therefore it

    is very difficult (if not impossible) to propose a metric which is in perfect correlation to it. A

    thorough experimental analysis of all these metrics (which we are currently performing, but

    which is outside the scope of this paper) could give us a good enough approximation. In the next

    Section we will present some experiments using the simple metric presented first, and show that

    it already yields quite reasonable improvements.

  • 8/4/2019 Personalizing the Web Directories

    29/40

    Experimental Results

    To evaluate the benefits of our personalization algorithm, we interviewed 17 of our colleagues

    (researchers in different computer science areas, psychologists, pedagogues and designers),

    asking each of them to define a user profile according to the Open Directory topics (see Section

    3.1 for an example profile), as well as to choose three queries of the following types: One clear

    query, which they knew to have one or maximum two meanings3 One relatively ambiguous

    query, which they knew to have two or three meanings One ambiguous query, which they knew

    to have at least three meanings, preferably more We then compared test results using the

    following four types of Web search: 1. Plain Open Directory Search 2. Personalized Open

    Directory Search, using our algorithm from Section 3.1 to reorder the top 1000 results returned

    by the ODP Search 3. Google Search, as returned by the Google API [8]

    Personalized Google Search, using our algorithm from Section 3.1 to reorder the top 100 URLs

    returned by the Google API4, and having as input the Google Directory topics returned by the

    API for each resulting URL. For each algorithm, each tester received the top 5 URLs with

    respect to each type of query, 15 URLs in total. All test data was shuffled, such that testers were

    neither aware of the algorithm, nor of the ranking of each assessed URL. We then asked the

    subjects to rate each URL from 1 to 5, 1 defining a very poor result with respect to their profile

    and expectations (e.g., topic of the result, content, etc.) and 5 a very good one5. Finally, for each

    sub-set of 5 URLs we took the average grade as a measure of importance attributed to that pair.

    The average values for all users and for each of these pairs can be found in table 1,

    together with the averages over all types of queries for each algorithm. We of course expected

    the plain ODP search to be significantly worse than the Google search, and that was the case:

    an average of 2.41 points for ODP versus the 2.76 average received by Google. Also predictable

    was the dependence of the grading on the query type. If we average the values on the threecolumns representing each query type, we get 2.54 points for ambiguous queries, 2.91 for semi-

    ambiguous ones and 3.25 for clear ones - thus, the clearer was the query, the better rated were

    the URLs returned. Personalized Search using ODP. But the same table 1 also provides us with a

    more surprising result: The personalized search algorithm is clearly better than Google search,

    regardless whether we use Open Directory or Google Directory as taxonomy. Therefore, a

  • 8/4/2019 Personalizing the Web Directories

    30/40

    personalized search on a well-selected set of 4 million pages often provides better results than a

    non-personalized one over a 8 billion set. This a clear indicator that taxonomy-based result

    sorting is indeed very useful. For the ODP experiments, only our clear queries did not receive a

    big improvement, mainly because for some of

    these queries ODP contains less than 5 URLs matching both the query and the topics expressed

    in the user profile. Personalized Search using Google. Similarly, personalized search using

    Google Directory was far better than the usual Google search. We would have expected it to be

    even better than the ODP based personalized search, but results were probably negatively

    influenced by the fact that the ODP experiments were run on 1000 results, whereas the Google

    Directory ones only on 100, due to the limited number of Google API licenses we had. The

    grading results are summarized in Figure 2. Generally, we can conclude that personalization

    significantly increases output quality for ambiguous and semi-ambiguous queries. For clear

    queries, one should prefer Google to Open Directory search, but also Google Directory search to

  • 8/4/2019 Personalizing the Web Directories

    31/40

    the plain Google search. Also, the answers we sketched in the beginning of this Section proved

    to be true: Google search is still better than Open Directory search, but we provided a

    personalized search algorithm which outperforms the existing Google and Open Directory search

    capabilities. Another interesting result is that 40.98% of the top 100 Google pages were also

    contained in the Google Directory. More specifically, for the ambiguous queries 48.35% of the

    top pages were in the directory, for the semi-ambiguous ones 41.35%, and for the clear ones

    33.23%6. Finally, let us add that we performed statistical significance tests7 on our experiments,

    obtaining the following results: Statistical significance with an error rate below 1% for the

    algorithm criterion, i.e., there is significant difference between each algorithm grading. An

    error rate below 25% for the query type criterion, i.e., the difference between the average

    grades with respect to query types is less statistically significant. Statistical significance with an

    error rate below 5% for the inter-relation between query type and algorithm, i.e., the re-

    EXTENDING ODP ANNOTATIONS TO THE WEB

    In the last Section we have shown that using ODP entries and their categorization directly for

    personalized search turns out to be amazingly good. Can this huge annotation effort invested in

    the ODP project (with 65,000 volunteers participating in building and maintaining the ODP

    database) be extended to the rest of the Web? This would be useful if we want to find less highly

    rated pages not contained in the directory. Just extending the ODP effort does not scale, because

    first, significantly increasing the number of volunteers seems improbable, and second, extending

    the selection of ODP entries to a larger percentage obviously becomes harder and less rewarding

    once we try to include more than just the most important pages for a specific topic. We start

  • 8/4/2019 Personalizing the Web Directories

    32/40

    with the following questions: Given that Page Rank for a large collection of Web pages can be

    biased towards a smaller subset, can this be done with sets of ODP entries corresponding to

    given categories / subcategories as well? Specifically, ODP entries consist of many of the

    most important entries in a given category. Do we have enough entries for each topic such that

    biasing on these entries makes a difference?

    When does biasing make a difference?

    One of the most important work investigating Page Rank biasing is. It first uses the 16 top levels

    of the ODP to bias Page- Rank on and then provides a method to combine these 16 resulting

    vectors into a more query-dependant ranking. But what if we would like to use one or several

    ODP (sub-)topics to compute a Personalized Page Rank vector? More general, what if we would

    like to achieve such a personalization by biasing Page Rank towards some generic subset of

    pages from the current Web crawl we have? Many authors have used such biasing in their

    algorithms. Yet none have studied the boundaries of this personalization, the characteristics the

    biasing set has to exhibit in order to obtain relevant results (i.e., rankings which are different

    enough from the non-biased Page Rank). We will investigate this in the current Section. Once

    these boundaries are defined, we will use them to evaluate (some of) the biasing sets available

    from ODP in Section 4.2. First, let us establish a characteristic function for biasing sets, which

    we will use as parameter determining the effectiveness of biasing. Pages in the World Wide Web

    can be characterized in quite a few ways.

    The simplest of them is the out-degree (i.e., total number of out-going links), based on

    the observation that if biasing is targeted to such a page, the newly achieved increase in Page

    Rank score will be passed forward to all its out-neighbors (pages to which it points). A more

    sophisticated version of this measure is the hub value of pages. Hubs were initially defined in

    and are pages pointing to many other high quality pages. Reciprocally, high quality pages

    pointed to by many hubs are called authorities. There are several algorithms for calculating thismeasure, the most common ones being HITS and its more stable improvements SALSA and

    Randomized HITS. Yet biasing on better hub pages will have less influence on the rankings

    because the vote a page gives is propagated to its out-neighbors divided by its out-degree.

    Moreover, there is also an intuitive reason against this measure: Page Rank biasing is usually

    performed to achieve some degree of personalization and people tend to prefer highly valued

  • 8/4/2019 Personalizing the Web Directories

    33/40

    authorities to highly valued hubs. Therefore, a more natural measure is an authority-based one,

    such as the non-biased Page Rank score of a page. Even though most of the biasing sets consist

    of high Page Rank pages, in order to make this analysis complete we have run our experiments

    on different choices for these sets, each of which must be tested with different sizes. For

    comparison to Page Rank, we used two degrees of similarity between the non-biased Page Rank

    and each resulting biased vector of ranks. They are defined in as follows: 1. O Sim indicates the

    degree of overlap between the top n elements of two ranked lists and It is defined as

    KSim is a variant of Kendalls T distance measure. Unlike OSim, it measures the degree of

    agreement between the two ranked lists. If U is the union of items in and and is U \

    then let be the extension of containing appearing after all items in . Similarly,

    is defined as an extension of . Using these notations, KSim is defined as follows:

    Even though used n = 20, we chose n to be 100, after experimenting with both values and

    obtaining more stable results with the latter value. A general study of different similarity

    measures for ranked lists can be found in. Let us start by analyzing the biasing on high quality

    pages (i.e., with a high Page Rank). We consider the most common set to contain pages in the

    range [0 10]% of the sorted list of Page- Rank scores. We varied the sum of scores within this

    set between 0.00005% and 10% of the total sum over all pages (for simplicity, we will call this

    value TOT hereafter). For very small sets, the biasing produced an output only somewhat

    different: about 38% Kendall similarity (see Figure 3). The same happened for large sets,

    especially those above 1% of TOT. Finally, the graph makes also clear where we would get the

    most different rankings from the non-biased ones

  • 8/4/2019 Personalizing the Web Directories

    34/40

    Someone could wish to bias only on the best pages (the top [0 2]%, as in Figure 4). In this

    case, the above results would only be shifted a little bit to the right on the x-axis of the graph,

  • 8/4/2019 Personalizing the Web Directories

    35/40

    i.e., the highest differences would be achieved for a set size from 0.02% to 0.75%. This was

    expectable, as all the pages in the biasing set were already top ranked, and it would therefore

    take a little bit more effort to produce a different output with such a set. Another possible input

    set consists of randomly selected pages (Figure 5). Such a set most probably contains many low

    Page Rank pages. This is why, although the biased ranks are very different for low TOT values,

    they start to become extremely similar (up to almost the same) after TOT exceeds 0.01%

    (because it would take a lot of low Page Rank pages to accumulate a TOT value of 1% of the

    overall sum of scores, for example). The extreme case is to bias only on low Page Rank pages

    (Figure 6). In this case, the biasing set will contain too many pages even sooner, around TOT =

    0.001%. The last experiment is mostly theoretical. One would expect to obtain the smallest

    similarities to the non-biased rankings when using a biasing set from [2 5]% (because these

    pages are already close to the top, and biasing on them would have best chances to overturn the

    list). Experimental results support this intuition (Fig

  • 8/4/2019 Personalizing the Web Directories

    36/40

    The graphs above were initially generated based on a crawl of 3 million pages. Once all of them

    had been finalized, we selectively ran similar experiments on the Stanford Web Base crawl,

    obtaining similar results. For example, a biasing set of size TOT = 1% containing randomly

    selected pages produced rankings with a 0.622% Kendall similarity to the non-biased ones,

    whereas a set of TOT = 0.0005% produced a similarity of only 0.137%. This was necessary in

    order to prove that the above discussed graphs are not influenced by the crawl size. Even so, the

    limits they establish are not totally accurate, because of the random or targeted random selection

    (e.g., towards top [0 2]% pages) of our experimental biasing sets.

    Is biasing possible in the ODP context?

    The URLs collected in the Open Directory are manually added Web pages supposed to

    (1) cover the specific topic of the ODP tree leaf they belong to and

  • 8/4/2019 Personalizing the Web Directories

    37/40

    (2) be of high quality. Both requirements are not fully satisfied. Sometimes (rarely though) the

    pages are not really representing the topic in which they were added. More

    important for PageRank biasing, they usually cover a large interval of page ranks, which made us

    decide for the random biasing model. However, we are aware that in this case, the human editors

    chose much more high quality pages than low quality ones, and thus the decisions of the analysis

    are susceptible to errors. Generally, according to the random model of biasing, every set with

    TOT below 0.015% is good for biasing. According to this, all possible biasing sets analyzed in

    tables 4, 5 and 3 would generate a different enough Page Rank vector9. We can therefore

    conclude that biasing is (most probably) possible on all subsets of the Stanford Open Directory

    crawl.

  • 8/4/2019 Personalizing the Web Directories

    38/40

    Web usage mining has been extensively used in order to analyze web log data. There exist

    various methods based on data mining algorithms and probabilistic models. The related literature

    is very extensive and many of these approaches fall out of the scope of this paper. For more

    information, the reader may refer to. There exist many approaches for discovering sequences of

    visits in a web site. Some of them are based on data mining techniques, whereas others use

    probabilistic models, such as Markov models in order to model the users visits. Such approaches

    aim at identifying representative trends and browsing patterns describing the activity in a web

    site and can assist the web site administrators to redesign or customize the web site, or improve

    the performance of their systems. They do not, however, propose any methods for personalizing

    the web sites. There exist some approaches that use the aforementioned techniques in order to

    personalize a web site. Contrary to our approach, these approaches do not distinguish between

    different users or user groups in order to perform the personalization. Thus, the methods that

    seem to be more relevant to ours, in terms of identifying different interest groups and personalize

    the web site based on these profiles, are those that are based on collaborative filtering.

    Collaborative filtering systems are used for generating recommendations and have been broadly

    used in e-commerce. Such systems are based on the assumption that users with common interests

    and behavior present similar searching/browsing behavior. Thus, the identification of similar

    user profiles enables the filtering of relevant information and the generation of

    recommendations.

    Similar to such approaches, we also identify users with common interests and use this

    information to personalize the topic directory. In our work, however, we do not model the user

    profiles as vectors in order to find similar users. Instead, we use clustering to group users into

    interest groups. Moreover, we propose the use of sequential pattern mining in order to generate

    recommendations. Thus, we also capture the se- quential dependencies within users visits,

    whereas this is not the case with collaborative filtering systems. All of the aforementioned

    approaches aim at personalizing generic web sites. Our approach focuses on the personalization

    of a specific type of web sites, that of topic directories. Since topic directories organize web

    content into meaningful categories, we can regard them as a form of digital library or portal. In

    this context, we also overview here some approaches for personalizing digital libraries and web

  • 8/4/2019 Personalizing the Web Directories

    39/40

    portals. Some early approaches were based on explicit user input and the personalization services

    provided are limited to simplified search functionalities or alerting services. propose the semi-

    automatic generation of user recommendations based on implicit user input. In those approaches,

    information is extracted from user accesses in the DL re- sources, and then is used for further

    retrieval or filtering. As already mentioned, our approach does not limit its personalization

    services on identifying the preferences of each individual user alone. Rather, we identify user

    groups with common interests and behavior expressed by visits to certain categories and

    information resources. This is enabled by approaches that are based on collaborative filtering.

    Those approaches, however, fail to capture the sequential dependencies between the users visits,

    as discussed previously.

    MODELLING TOPIC DIRECTORIES

    A topic directory is a hierarchical organization of thematic categories. Each category contains

    resources (i.e., links to web pages). A category may have subcategories and/or re- lated

    categories. Subcategories narrow the content of broad categories. Related categories contain

    similar resources, but they may exist in different places of the directory. Note that the related

    relationship is bidirectional, that is, if category N is related to M, then M is also related to N. A

    resource cannot belong to more than one category. We consider a graph representation of topic

    directories. Definition 3.1. A topic directory D is a labelled graph G(V,E), where V is the set of

    nodes and E the set of edges, such that: (a) each node in V corresponds to a category of D, and is

    labelled by the category name, (b) for each pair of nodes (n,m) that corresponds to categories

    (N,M), where N is subcategory of M in D, there is a directed edge from m to n, and (c) for each

    pair of nodes (n,m) that corresponds to categories (N,M), where N and M are related categories

    in D, there is a bi directed edge between n and m. The graph G(V,E) may also have shortcuts,

    which are directed edges connecting nodes in V . Examples of such graphs are illustrated inFigure 4. The role of shortcuts as a means for personalizing the directory will be further

    discussed in Section 5. The case study of Open Directory Project. In our work, we use the Open

    Directory Project (ODP) as a case study. Figure 1 illustrates a part of the ODP directory. In

  • 8/4/2019 Personalizing the Web Directories

    40/40

    ODP, there are three types of categories: (a) subcategories (to narrow the content of broad

    categories), (b) relevant categories (i.e., the ones appearing inside the see also section, and (c)

    symbolic categories (i.e., denoted by the @ character after categorys name). Symboliccategories are subcategories that exist in different places of the directory. We consider relevant

    categories as related categories, according to the Definition 3.1

    Navigation patterns.

    To represent the navigation behaviour of users when browsing the directory, we use the notion of

    navigation patterns. A navigation pattern is the sequence of categories visited by a user during a

    session. We note that such patterns may include multiple occurrences of the same categories.

    This might be the result of users going back and forth within a path in the directory. Finally, we

    also underline that during a session, a user may pursue more than one topic interests.