dynamic modeling byusage data for personalization systems

2009 13th International Conference Information Visualisation

Dynamic Modeling by Usage Data for Personalization Systems

Saeed R. Aghabozorgi', Teh Ying Wah2

Department of Information Science, Faculty of Computer Science and InformationTechnology, Unijersity of Malaya, Kuala Lumpur - Malaysia

[email protected]@um.edu.my

Abstract

With the extensive growth of data available on theInternet, personalization of this huge informationecomes essential. Although, there are various

techniques of personalization, in this paper weoncentrate on using data mining algorithms to

personalize web sites' usage data. This paperp_voposes an off-line model based web usagemining that is generated by clustering algorithm.T-hen, we will use users' transactions periodicallyto change the off-line model to a dynamic-model.This proposed approach will solve the problem ofthe decrease of accuracy in the off-line modelsover time resulted of new users joining or changeso] behaviour for existing users in model-basedapproaches. Finally, we discuss the on-line modelfor user behaviour prediction in the webpersonalization system.

1. IntroductionEvery day, millions of electronic pages, are added

on hundreds of millions pages that are already on-line.With this substantial growth of available data on theInternet and because of its rapid and chaotic growth,the World Wide Web has emerged into a network ofinformation without organizational structure.Furthermore, existence of abundant information in thenetwork and the dynamic and heterogeneous nature ofthe web, web exploration has become a difficultprocess for most of the users. Causing users feeldisoriented and sometimes lost in overloadedinformation that continues to expand. Besides, e-?usiness and web marketing are rapidly developing andImportance of anticipate the need of their customers isevident more than ever. Therefore, predicting theusers' interests in order to improve the usability of webor so called personalization ha become a nece sity.According to Mobasher [6], "Web per onalization canbe de cribed as any action that make the webexperienc of a u er personalized to the u er's ta te."

Usually, three type of data have to be managed in aweb site: content, tructure and log data. Content data

978-0-7695-3733-7/09 25.00 2009 IEEEDOl 10.1 I09/IV.2009. I 11

consist of whatever is in a web page; structure datarefer to the organization of the content; usage data arethe usage patterns of wei? sites. The application of thedata mining techniques to these different data sets is atthe basis of the three different research directions in thefield of web mining: web content mining, webstructure mining and web usage mining [1]. Srivastavaand et al. [4] define the web usage mining as "Webusage mining is the application of data miningtechniques to discover usage patterns from Web datain order to understand and better serve the needs of~eb-based applications." Note that web usage miningdiffers from web structure mining and web contentmining, in that web usage mining reflects thebehaviour of humans as they interact with the Internet.Analysis of user behaviour in interaction with web sitecan provide insights leading to customization andpersonalization of a user's web experience. Because ofthis, web usage mining is of intense interest for e-marketing and e-commerce professionals [7].

Most of the existing works in the context of webusage mining try to classify a user while she isbrowsing the web site or using user's registrationinformation [1].Our main goal in this paper is based onthis fact that in most web sites, it is not possible toperform an "on-line" modelling in term of quality(huge number of users or items). In addition, using astatic model constructed in "off-line" mode maygenerate inaccurate result because the interests of userschange over time and besides, new Hems and newusers will be added to web site frequently. In this paperwe present a model that is made in "off-line" modeand then we change it to a dynamic model to predictuser's interests in "online" mode. The novelty of ourapproach is that of using a 3 step recommendationsystem for personalization: The first stage is generationof an explicit model in "off-line" mode by data miningalgorithms. This model is created by all userbehaviours data collected during previous interactionswith web site. The second stage is done periodically toavoid remodelling of generated model. Because thebehaviour of users change over time and al 0, the totalnumber of u ers of web site i increa ed, we u e u er'transactions after generating "off-line" model to adju tthe model. Thi approach change th tatic model to a

450IEEE

computersociety

mailto:[email protected]

mailto:[email protected]

~ynamic model. The last phase is carried out in "real-time" as a new visitor begins an interaction with theWeb site. Data from the current user session is scoredusing the models generated offline, andrecommendations generated based on this scoring. A?eneral architecture for three step web mining isIllustrated in figure I.

Personalization SystemFig. 1 Summary of the architecture and steps in usage mining

So far we have provided an overview of web miningand web usage mining for personalization. Then weshowed the problem in standard techniques in~ersonalization and a general overview of our approachIn order to solve the problem.The. remaining of this paper is organized as follows: InsectIOn 2, we provide a detailed discussion of a host of?ata mining activities necessary for this process,Including the pre-processing and integration of datafrom multiple sources, transaction identification as datasource of web usage mining and a mathematical modelof p~ge-views in the process. In this Section wedescnbe the data available for mining in the Webdomain, specifically for the generation of user models.Then, we present the web usage mining concept an.d== logs processing architecture in section3. In thissectIOn the focus wi Il be on Web usage mining where~he goal is to use data collected from the userInteractions with the Web in order to learn user modelsand to use these models for personalization in t~efuture. We describe the clustering techniques used III

generating the model of user behaviour in duration of~ser access to web site. Then, we define an approachor achieve an aggregate usage profile to present theusage of website. In section 4, we improve the modelby changing the static model to a dynamic model. Inthe section 5 we show how a recommendation systemu .e the model for combining the discovered knowled~eWith the current status of a user's activity in a Web site

to ~rovide perso~alized content ~o a user. Finally,sectIOn 6 summanzes our conclusions and discussionon the current state and future direction of research inweb usage mining.

2. Data PreparationVariety types of data expressions could be used to

model the co-occurrence of Web user behaviours, suchas matrices, directed graphs and click sequences and soon. Different data expression models have differentmathematical and theoretical backgrounds. In thispaper, the user navigational behaviour is modelled by ausage matrix, in the form of transaction vectors.

2.1. Data SourceIn Web mining, data can be collected at the server-

side, client-side, proxy servers, or a consolidatedweb/business database. In [2], Srivastava et al. presenta more detailed description of these data sources. Tosummarize, (i) Web server logs explicitly recordsbrowsing behaviour of site visitors, (ii) Client-side datacollection can be implemented by using a remote agentor by modifying the source code of an existing browser(iii) and Web proxies act as an intermediate level ofcaching between client browsers and Web servers.

The primary data sources used in Web usage miningarea are user queries, registration data and the serverlog files which include Web server access logs andapplication server logs [3,8]. In this paper, Web serverlog file and domain knowledge of web sites is used asdata source. Web server access logs store datapertaining to access of the website. Each hit against theserver, corresponding to an HTTP request, generates asingle entry and stores it in web server access log.

2.2. Pre-processingThe data collected from different data sources are

stored as a data mart in a data warehouse (such as thosedescribed in [9]). The information provided by thesedata sources in data warehouse is used to constructseveral data abstractions, namely users, page-views,click-streams, and server sessions [3] .Analyses ofthese data will lead to extract model of user behaviourin interaction with the web site. Hence, beforeprocessing of the data we need to perform some pre-processing tasks. Pre-processing consists of convertingusage data contained in the various available datasources into the data abstractions necessary for patterndiscovery. In this section we describe pre-processingsteps performed on web log files including useridentification, page-view identification, transactionidentification, and data storage in a repository.

451

· Meanwhile, we use domain knowledge of web site orconcept of web site for abstraction and cleaning input· file. Each web-site, depending on its application,'provides information about one or more concepts. Forexample, amazon. com includes concepts such as

· Books, Music, Video, etc. The web pages within awebsite are categorized based on the concept(s) towhich they belong. A concept space, or simplyconcept, in a website is defined as the set of web pagesthat contain information about a certain concept [10].

~.3. User IdentificationIt is important to notice that the data collected by

server logs may not be entirely reliable because somepage views may be cached by the user's browser or bya proxy server [14].The data recorded in server logsreflects (possibly concurrent) the access of a Web siteb multiple users, and only the IP address, agent andserver side click-stream are available to identify usersa server sessions. In a Web server log, all requestsfrom a proxy server have the same identifier, eventhough the requests potentially represent more than oneuser. On the other hand, the Web server can also storeother kinds of usage information such as cookies,which are markers generated by the Web server forindividual client browsers to automatically track thesite visitors. But, a user may access the Web throughdifferent machines, or use more than one browser atone time. In practice, it is very difficult to uniquely andrepeatedly identify users. In this paper we identify eachuser through cookies, logins, or IP/agent analysis. Weassume the existence of a set of m users, U={uJ, U2,",urn} where each user is defined as a single individualthat is accessing files in Web servers (page-views)through a browser.

2.4. Page-view IdentificationPage-view is an aggregate representation of a

collection of Web objects contributing to the displayon a user's browser resulting from a single user action(such as a click-through) [11]. We consider each itemin data mining as a page-view in web usage miningthat in the simplest case, each HTML file ha a one-to-one correlation with an item. If each page-view isshown as Pi then we define universe of all pages in aweb site as P= {PI. P2 ... p.}.

2.5. Transaction IdentificationAfter definition of each u er, the click- tream for

each user must be divided into e ion. A erver-e ion (or visit) i the c1ick- tream ( equence of page-views) by a single user for a particular purpo e in a

single visit. The end of a server-session is defined asthe point when the user's browsing session at that sitehas ended [11]. But because it is not clear when theuser has left the Web site, a timeout is often used as thedefault method of breaking a user's click-stream intosessions. Various heuristics for defining sessionsespecially with anonymous users have been identifiedand studied in [14] and [15]. Also, the impact ofdifferent heuristics on various Web usage mining taskshas been analysed in [16]. Each session will be furtherabstracted by selecting. a subset of page-views in thesession that are significant or relevant for the analysistasks at hand. This process is called transactionidentification where we extract transactions correspondto each user from sessions when sessions in log file isdefined. A transaction is a set of entries in web accesslog accessed by a user from the same machine indefined time windows and in a single session [sal 0]. Inthis stage we omit intermediate pages in session that isaccessed by user for navigational reasons. It meansonly relevant page-views are included in thetransaction and if a page does not access at all wouldnot include in transaction. Every transaction is shownas an n-dimensional vector over the universe of itemsor especially in web mining as page-view vectors, i.e.Transaction ti is shown as

tj =< wept> t.), w(Pz, t.), ...,W(Pn' tj) > (I)

Which is a group of Web page accesses w(Pi,tD wherew represents weight of Pi in transaction t; The weightis determined by features of page-view such as pageduration, purchase a product, rate and etc. In this paperwe define transactions for each user to apply clusteringalgorithms and dynamically create meaningful clustersof users, based on underlying model of the user'sbrowsing behaviour.

2.6. Storing User Profiles in RepositoryIn this stage we will aggregate all transactions

pertaining to a user during prior interactions with websites to generate profile of each user. The profile ofeach user can be short-term profile based on only onevisit of user from site or aggregation of all transactionsthrough the all visit of user form website that is calledlong-term profile. Mobasher and et al [11] representthis profiles as an m*n matrix. They show the profilefor a user u E U as an n-dimensional vector of orderedpairs:

u(n)= «iJ, su(il», (i2, u(h»,"', (in, 'u(in»> (2)

where ij' E I and u i a function for u er u assigning(po ibly null) intere t core to item. In thi stage anagent i defined that for each valid it m (page-view)a ign corre ponding concept (category) ba ed on ite

452

structure information present on the page's URL orbased on the defined structure in content managementsy~tems or based on keywords included in page-views.I~ III our research we consider the significant pages fordIfferent page's type, the agent performs the definitionof more significant pages than others as well. In thisstage user profiles are stored in a data mart.Each page-view is associated to one category and auser accesses one or more items during a session.Profile data mart contains accesses of all usersaccurately included both navigation behaviour anddomain knowledge of web site as shown in conceptualmodel. Then, useful knowledge is extracted fromprofile of users from the data mart.

3. Modelling by data mining algorithmThe data in repository of user profiles is considered

~s a matrix that is shown as weighted collection ofItems. The weighted items in the matrix correspond tothe weighted pages-views in user transactionsconstructed in previous phase (preparation). In thismatrix each column is associated to a page-view andeach row represents a user profile. The weights in thematrix are determined in a number of ways, forex~mple, binary weights can be used to representeXIstence or non-existence of a product-purchase or adocument access in the user transactions. Additionally,the weights can be a function of the duration of the~ssociated page-view in order to capture the user'smteresi in a content page [12]. The weights may also,III .part, be based on domain-specific significancewe~ghts (for example navigational pages m~y beweIghted less heavily than content or product-onented~age-views). With this representation, the user profileIS considered as a page-view vector that each Item .int~e vector has an associated weight that represents ItSSIgnificance in the profile. .

At this stage, an explicit "off-line" model ~sgenerated by data mining algorithms. This model. IScreated by user behaviour data collected dUflngprevious interactions with web site. A number of datamining algorithms can be used for "off-line" model~Uildin? such as Clustering, Classification, Associationule DiScovery, Sequential pattern Discovery, Markov

mode.ls, hidden (Iatent)variable models. Each of th~sealgonthms generates a model that is used for analyslllgth .e access pattern of users. We use a clustenngalgorithm to partition users based on their profiles.Clustering is an undirected or unsupervised method,me~ning that the analyst need not define a targetvanable. Instead, the data mining algorithm searchesfor patterns and structure among the variables [7J.After applying clustering algorithm, each .c1u~terrepre ent a group of users with similar navigation

patterns. There are various algorithms which are usedfor clustering user profiles. For our approach we usetraditional k-mean algorithm for cluster our users'profiles. Standard clustering algorithms, such as k-means, generally partition this space into groups oftransactions that are close to each other based on ameasure of distance or similarity [12]. K-meanalgorithm is an algorithm to cluster N objects (datapoint) based on attributes into K partitions, that k<N. Itassumes that the object attributes form a vector space.The objective that it tries to achieve is to minimizetotal intra-cluster variance, or, the squared errorfunction [18]:

(3)

Where Un E C, is a vector representing the nth datapoint and in our context the user profile and there are kclusters Cj, i = 1, 2, ..., k, and Ilj is the centroid(geometric centroid of data points) or mean point of allthe points U, E C,

Implementation of this algorithm is included belowstages: 1) Selecting k point randomly as k centroid ofclusters 2)For each data point x, compute itsmembership in clusters by choosing the nearestcentroid 3)For each centroid, recomputed its locationbased on members

Clustering of user's profiles table is based on havingsomething in common in user transactions. Withapplying K-mean algorithm on user profiles a set ofclusters is achieved that each cluster C, includes asubset of the users with similar navigational patterns(previous behavior in interaction with site) or on theother word same transactions. In process of clustering,users' profile are clustered based on the similarity ofthe interest scores for page-views across all users, orbased on similarity of their content features. Eachcluster is included hundred of user profiles. On theother hand, each user profile is involving hundreds ofpage-view references and it should be reduced torepresent clusters into weighted collections of page-views. At first, we try to extract separate clusters withhigh confidence, because later we will use theaggregated of these clusters as centroid of clusters indynamic model. Then, we use profile aggregation as amethod for derivation of aggregated profiles fromclusters. As mentioned, because each cluster includestransactions involving many page-views, an aggregatedprofile is generated from each cluster. To aggregatetransactions we use concept of PACT approach [12](Profile Aggregations based on ClusteringTransactions). This approach is a technique analogousto concept indexing methods used to extract documentcluster summaries in information retrieval and filtering

453

[13]. But here we use aggregate of profiles instead ofall transactions in a cluster to generate aggregatedprofiles. The aggregate profiles are generated based oncentroid of each cluster. For each cluster C, ETc, thecentroid (m.) is computed by finding the ratio of thesum of the page-view weights across profiles in C, tothe total number of users in the cluster.

(4)

4. Dynamic data modelGenerating the user profile model from transactions

is the off-line part of modelling. Because new usersand new items increase and in addition, the currentusers' interest change over time and subsequently userswill do new transactions with the web site, we use anapproach to improve clusters by adjusting modelthrough new transactions. In this approach we considerthe behaviour of user since making offline model andcalculate the probability of assigning a user to otherclusters. To this aim, at first, interesting of each clusterabout each category in a predefmed interval iscalculated by counting the number of request sent toeach category from each cluster. Then we willcalculate the probability of interesting of each userabout each category by counting the number of requestfrom a user regarding the probability of cluster'sinteresting. Finally, we assign a user to a cluster thathas maximum probability.

The stage of attributing categories to clusters can bedone by considering the initial clusters achievedthrough the k-mean algorithm and by number oftransactions done by the users of each cluster with acategory domain after generating static model inoffline mode (time interval). Each category are relatedto the cluster with a probability value equals the ratioof the number of requests sent to the category fromusers of the cluster over all clusters. Although, thistype of making correlation is not completely accurate,but if the period of time be long and the number oftransactions be big enough, it can be reliable.Furthermore, there are some techniques for findingclusters that consider the content categories of pages.To shed light this issue, the frequent item set i anexample of finding relation among user cluster andcontent categories [17].

For calculating relation between clu ter andcategory Pj in ontology domain, U, is defined auniverse of regi tered u er and ,a initial clu terproduced by a clu tering algorithm. iven the totalnumber of u er belonging to each clu ter after theclu tering N" total number of reque t of U, in clu ter, for each category RUkJ .the reque t of clu ter ,for

each category RCij is total number of requests from thecluster divided by total number of users within thecluster: (it is normalized requests sent for eachcategory over total number of users within eachcluster)

(5)

A matrix of cluster-category represents correlationamong clusters and categories. Each cell representsprobability PCjj that 1S relation between category Pj andcluster C, is computed:

(6)

Then, we find the change of users' interests byconsidering transactions made after clustering. If therequests that user has sent through transactions forcontent categories is different with the categories thathis cluster belongs, the cluster of user will change. Inthe other word, the probability that a user belongs to acluster is computed by counting the number of timeseach user sends request for every category afterconstructing the offline model.

I RUkj ,PCijPUCk = _U--,k,_E_C,,_i _

I Ni(7)

The last action is assigning each user to the clusterthat user has maximum probability that is belong tothat. Therefore, if the majority of a user's request is fora category that has not a high correlation with hercluster, user will drop in another cluster. The modelthat is constructed, considers the changes in thebehavior of users after off-line modeling and do notneed to apply clustering algorithm on all data.Therefore, this model can be used as a dynamic modelwithout scalability shortcomings in traditional on-linemodels.

5. PredictionIn order to addre s the requirement of effective web

navigation, web sites provide personalizedrecommendations to the end u er . The main purposeof per onalization i prediction a t for the active u er.Thi et mu t be including the object (link,adverti ement , text, product, et ) that are arne aactive u er profile. Thi part of per onalization mu t bed ne in onlin m de. For thi me n, th ystem needth hi t ry of u er and tran action with web ite or atlea t the track of current i it.

4 4

For recommendation systems, different predictionalgorithms are used. For the registered users wecompare the centroid of the cluster that user belongs:vith the matrix of user-item generated to findinginterest items in repository to generaterecommendation. The items in the centroid vector thathave higher weight rather than the interest items in thematrix are interest items. For recommend toanonymous users, we will use nearest neighbour topredict items for users. To this aim, a matrix of user-item is generated to finding interest items for a user.The columns of this matrix is items of centroid vectorthat is most similarity with the user-session and thevalue of cells is the items that user has shown thatinterested in short session or long sessions. The interestitem for the user is the null value in matrix.. If the data collection procedures in the systeminclude the capability to track users across visits, thenthe recommendation set can represent a longer termview of potentially useful links based on the user'sactivity history within the site. If, on the other hand,profiles are derived from anonymous user sessionscontained in log files, then the recommendationsprovide a "short-term" view of user's navigationalhistory [12]. These recommended objects are added tothe last page in the active session accessed by the userbefore that page is sent to user's browser.

6. Conclusions. Generating an offline model using usage data as anIllformation source will dramatically improve therecommendation performance and the online responseefficiency in personalization systems. On the otherhand, because the behaviour of users changes overtime, the model must regenerated using all usage datathat in term of quality is hard. In this paper a newapproach for personalization system by web usagemining was presented. We applied a data mrnmgtechnique for generating an offline model of user'saccess pattern in visit websites. Then we made anaggregated user profile that can be used as maincomponent of usage-based recommender syst~m. forprediction and recommendation for personahzatlOn.The advantage of our approach is generating a modelIII off-line mode and then changes the model to adynamic model. Because the time consuming part ~fm~del1ing is done by offline mode, and the model ~sadjusted periodically, personalization system won. tneed regenerating of model over time. Alt~ou~h thisapproach consider new users and changing III mtere.stof u ers and solve this problem to a large extend, III

some parts it needs more works. For exampleconsidering user's demographic in clustering and

finding the best number of clusters III k-rneanclustering for our approach.

7. References[I] F. Zhang and H. Chang. "Research and development in web

usage mining system- key issues and proposed solutions: asurvey". In First IEEE Int. Conf on Machine Learning andCybernetics Proceedings, pages 986-990, Nov. 2002.

[2] J. Srivastava, R. Cooley, M. Deshpande, P.-N. Tan, WebUsage Mining: Discovery and Applications of Usage Patternsfrom Web Data, SIGKKD Explorations, 1(2), Jan 2000.

[3] WWW Committee Web Usage Characterization Activity,http://www.w3.org/WCA, Web Characterization Terminology& Definitions Sheet, W3C Working Draft, May 1999.

[4] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, andPang-Ning Tan, Web usage mining: discovery andapplications of usage patterns from web data, SIGKDDExplor., I(12), Jan. 2000.

[5] R. Cooley, B. Mobasher, and J. Srivastava. Data preparationfor mining World Wide Web browsing patterns. Journal ofKnowledge and Information Systems, (1)1, 1999.

[6] B. Mobasher, R. Cooley, and J. Srivastava. Automaticpersonalization based on web usage mining. Communicationsof the ACM, 43(8): 142--151,2000.

[7] Z. Markov, and D. T. Larose, Data mining the Web "uncovering patfe~ns in Web content, structure, and usage,Hoboken, N.J.: Wiley-Interscience/John Wiley & Sons, 2007.

[8] M. Baglionil, U. Ferrara2, A. Romeil, S. Ruggieril, and F.Turinil, Preprocessing and Mining Web Log Data for WebPersonalization, 2003

[9] M. Sweiger, M.R. Madsen, J. Langston, and H. Lombard.Clickstream Data Ware-housing. John Wiley & Sons, 2002.

[10] C. Shahabi, and F. Banaei-Kashani, "A Framework forEfficient and Anonymous Web Usage Mining Based onClient-Side Tracking," WEBKDD 2001 - Mining Web LogData Across All Customers Touch Points, pp. 141-155, 2002.

[II] P. Brusilovsky, A. Kobsa, and W. Nejdl, The adaptive web,'methods and strategies of web personalization, Berlin; NewYork: Springer, 2007.

[12] B. Mobasher, H. Dai, T. Luo et al., "Discovery and evaluationof aggregate usage profiles for Web personalization," DataMining & Knowledge Discovery, vol. 6, no. I, pp. 61-82,2002.G. Karypis, E-H. Han. Concept indexing: a fastdimensionality reduction algorithm with applications todocument retrieval and categorization. Technical Report #00-016, Department of Computer Science and Engineering,University of Minnesota, March 2000.Cooley, R., Mobasher, B., Srivastava, J.: Data preparation formining World Wide Web browsing patterns. Journal ofKnowledge and Information Systems 1(1) (1999) 5-32Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.:A framework for the evaluation of session reconstructionheuristics in web usage analysis. INFORMS Journal ofComputing - Special Issue on Mining Web-Based Data for E-Business Applications 15(2) (2003)Za'rane, O.R., Srivastava, 1., Spiliopoulou M. Masand Beds.: Proceedings of WEBKDD 2002 -Mining Web Dat~ fo;Discovering Usage Pal/ems and Profiles. Volume 2703 ofLNCS. Springer Berlin I Heidelberg (2003) 159-179P. Batista, and M. 1. Silva, "Mining Web Access Logs of anOn-line Newspaper."1. B. MacQueen (1967): "Some Methods for classification andAnalysis of Multivariate Observations", Proceedings of 5-thBerkeley Symposium on Mathematical Statistics andProbability, Berkeley, University of California Press, 1:281-297

[13]

[14]

[15]

[16]

[17]

[18]

455

http://www.w3.org/WCA,

dynamic modeling byusage data for personalization systems

Documents