smart miner: a new framework for mining large scale web usage data

Smart Miner: A New Framework forMining Large Scale Web Usage Data

∗

Murat Ali Bayir†

Department of ComputerScience and Engineering

University at Buffalo, SUNY14260, Buffalo, NY, USA

[email protected]

Ismail Hakki TorosluDepartment of ComputerEngineering, METU NCC,Kalkanli, Guzelyurt, TRNC,

Mersin, [email protected]

Ahmet CosarIntelligent Data Analysis

GroupDepartment of Computer

Engineering, METU06531, Ankara, Turkey

[email protected]

Guven FidanAGMLAB Information

TechnologiesCyberPark, Bilkent

06800, Ankara, [email protected]

ABSTRACT

In this paper, we propose a novel framework called Smart-Miner for web usage mining problem which uses link infor-mation for producing accurate user sessions and frequentnavigation patterns. Unlike the simple session concepts inthe time and navigation based approaches, where sessionsare sequences of web pages requested from the server orviewed in the browser, Smart Miner sessions are set of pathstraversed in the web graph that corresponds to users’ naviga-tions among web pages. We have modeled session construc-tion as a new graph problem and utilized a new algorithm,Smart-SRA, to solve this problem efficiently. For the pat-tern discovery phase, we have developed an efficient versionof the Apriori-All technique which uses the structure of webgraph to increase the performance. From the experimentsthat we have performed on both real and simulated data,we have observed that Smart-Miner produces at least 30%more accurate web usage patterns than other approachesincluding previous session construction methods. We havealso studied the effect of having the referrer information inthe web server logs to show that different versions of Smart-SRA produce similar results. Our another contribution isthat we have implemented distributed version of the SmartMiner framework by employing Map/Reduce Paradigm. Weconclude that we can efficiently process terabytes of webserver logs belonging to multiple web sites by our scalableframework.

∗This work is supported by The Scientific and TechnicalResearch Council of Turkey with industrial project grantTEYDEB 7070405.†corresponding author.

Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.WWW 2009, April 20–24, 2009, Madrid, Spain.ACM 978-1-60558-487-4/09/04.

Categories and Subject Descriptors

H.2.8 [Database Management]: Database Applications—data mining ; D.2.8 [Software Engineering]: Metrics—performance measures

General Terms

Algorithms, Design, Experimentation, Performance

Keywords

Web Usage Mining, Graph Mining, Map/Reduce, ParallelData Mining, Web User Modeling

1. INTRODUCTIONAs in classical data mining, the aim of web mining is to

discover and retrieve useful and interesting patterns fromlarge web dataset [7]. In the last fifteen years, the WWWbecome the largest information sources with its amazinggrowing rate. All of this huge data available in the WorldWide Web can be mined mainly in three different dimen-sions, which are web content mining, web structure miningand web usage mining. This paper is related to the web us-age mining which can be defined as the application of datamining techniques to web log data in order to discover useraccess patterns [9, 26]. Web usage mining has various ap-plications such as link prediction, site reorganization andweb personalization. The success of all of these applicationsis significantly related to the outcomes of web usage min-ing process which includes session construction and frequentnavigation pattern discovery phases.

Producing accurate user sessions and navigation patternsis not an easy task since http protocol is stateless and con-nectionless. Also, in reactive session construction [8, 9, 23],where it is not possible to generate web log information toidentify individual users (like cookies), all users behind aproxy server will have the same IP number and therefore,these users will be seen as a single client on the server side.These problems can be handled by proactive strategies [21,

WWW 2009 MADRID! Track: Data Mining / Session: Web Mining

161

24] by using cookies, which add client specific informationinto each page request by using dynamic server page codes.However, in order to employ proactive strategies, internalstructure of web pages must be changed either by insertingJavaScript codes (called page tagging) or by using dynamicserver pages codes to collect session data. Nowadays, manyweb site owners prefer page tagging method provided by athird party, which works by installing Java script codes intotheir web pages to collect access log data. This data is usu-ally stored and processed on the servers of the third party.

However, for various reasons, such as security and changesin the internal structure of web site, some site owners maynot want to use proactive approaches at all. Instead of that,these site owners prefer to process only their raw server logs,which contain access requests. We had received this kindof request from several customers of AGMLAB InformationTechnologies to process their huge web server logs for webusage mining and employ complex filters over navigationpatterns for path analysis. Therefore, as part of this workwe had focused on reactive approaches and proposed a newframework called Smart-Miner to meet this type of demands.

Smart-Miner is a novel framework including a new sessionconstruction technique and an efficient pattern discovery al-gorithm. Smart-Miner is also distributed framework whichemploys Map/Reduce Paradigm to process very large serverlogs. Our session construction algorithm, Smart-SRA is de-signed in such a way that it can even produce sessions by us-ing the web topology and the server logs containing only thebasic information, not even the referrer information (such asCommon Log format1), from server log files. Furthermore,we have employed Smart-Miner on more comprehensive logformats like Combined Log Format which also contains re-ferrer information and thus eliminates the need for the webtopology.

Regardless of different web server logs, Smart-Miner em-ploys two main steps to obtain frequent patterns from the logfiles. In the first phase, Smart-SRA algorithm is employedto generate user sessions containing set of paths traversedby users on the web graph. In the second phase, an efficientversion of Apriori-All method is used to discover frequentnavigation paths from user sessions. The reason for usingSmart-SRA framework is to produce more correct sessionsthat captures more realistic user behavior as well as pro-viding good coverage for navigated paths. In particular thefollowings are the contributions of this paper:

• We have implemented a full web usage mining frame-work, Smart Miner, as commercial software which is asub module of our Web Analytics Service and works ona distributed architecture. Currently final polishing ofWeb Analytics Service is underway, and tests are per-formed on it before the final deployment. During thedevelopment of Smart-Miner framework, several novelideas were also used, which are listed below.

• We propose a novel session construction method, calledSmart-SRA, which models session construction processas a graph problem and produce maximal paths tra-versed on the web graph efficiently. Furthermore wehave implemented two different version of Smart-SRAthat can generate sessions by using only referrer infor-mation or by using only web topology.

1http://www.w3.org/daemon/user/config/logging.html

• We introduce a strictly sequential version of Apriori-All algorithm which exploits the structure of web graphto improve the performance. In particular, our patterndiscovery method generates candidate patterns onlycorresponding to the paths on the web graph.

• For simulating real user behavior, we have developedan agent simulator which can be used for comparingaccuracies of different Web Usage Mining methods.Our agent simulator is based on the random surfermodel of the page-rank algorithm. Our syntactic webtopology also obeys power law distribution property ofweb graphs.

• For estimating the accuracy of alternative Web UsageMining methods, we have proposed geometric accuracymodel which considers both recall and precision. Thismodel enables us to find the best method in termsof the number of sessions captured correctly and thenumber of sessions generated by session constructionmethods to capture the correct sessions

• Using both simulated and real data, we have showedthat in terms of our accuracy measure the maximal fre-quent patterns discovered by Smart-Miner frameworkis at least 30% much better than the ones obtained byweb usage mining techniques utilizing previous sessionconstruction methods.

• We have also implemented distributed version of Smart-Miner using Map-Reduce framework which enables usto process huge web server logs of multiple sites simul-taneously. Our results show that we have a significantperformance improvement in the distributed version ofthe Smart-Miner. In particular, we have showed thatour run time performance increases linearly and ourframework can be scaled-up easily to process any sizeof data.

This paper is organized as follows. The next section isdedicated to the previous session construction methods andour new method Smart-SRA. Section 3 discusses patterndiscovery problem by introducing strictly sequential Apriori-All technique. After that, we introduce the agent simulatorwhich was used to evaluate different session constructionheuristics. The first subsection of the experimental resultsis dedicated to the description of the accuracy metric forcomparing different web usage mining methods. In the nexttwo subsections, the experimental results on simulated andreal data are presented. The fourth subsection presents ex-perimental results related to the impact of the referrer in-formation on real case study. The fifth subsection discussesthe map/reduce version of the Smart-Miner framework andscalability issues. Finally, we give our conclusions in section6.

2. SESSION CONSTRUCTION

2.1 Previous Heuristics and Related WorkBefore going into details of the previous work, it is better

to inform the reader with the more general definition of webuser session in the literature [8, 9, 12, 21, 23, 24].

Definition (Session): Session is a sequence of requestsmade by a single user with a unique IP address on a Web


162

site during a specified period of time. Each request item inthe session can be provided by either web server or cachesystems from local client or proxies.

The most basic session definition comes with Time Ori-ented Heuristics[8, 12, 23] which are based on time limi-tations on total session time or page-stay time. They aredivided into two categories with respect to the thresholdsthey use:

• In the first one, the duration of a session is limitedwith a predefined upper bound, which is usually ac-cepted as 30 minutes according to [5]. In this type,a new page can be appended to the current session ifthe time difference with the first page doesn’t violatetotal session duration time. Otherwise, a new sessionis assumed to start with the new page request.

• In the second time-oriented heuristic, the time spenton any page is limited with a threshold. This thresh-old value is accepted as 10 minutes according to [5]. Ifthe timestamps of two consecutively accessed pages isgreater than the threshold, the current session is ter-minated after the former page and a new session startswith the latter page.

Navigation-Oriented approach [8, 9] uses the web graph[11, 17] constructed by using hyperlinks among web pages.In this approach, it is not necessary to have a hyperlink be-tween every two consecutive web page requests. Let P =[P1,P2, . . . , Pk, Pk+1, . . . , Pn] be a session containing web pageswith respect to their timestamp orders. In this session, forevery page Pk, except the initial page P1, there must be atleast one page Pj in the session which is referring to Pk andhas a smaller timestamp than Pk. If there are several pagesreferring to Pk with smaller timestamps, then, among thesepages, the one with the largest timestamp is assumed to bethe page visited before Pk is viewed. Therefore, during ses-sion construction, backward browser movements up to theclosest page referring to Pk are appended to the session.

More formally, if P =[P1, P2, . . . , Pk, Pk+1, . . . , Pn] are pagesforming a session, then, ∀k satisfying 1 < k ≤ n, ∃j suchthat, T (Pk) > T (Pj) and Pj refers to Pk. During the con-struction of a new session, if P =[P1, P2, . . . , Pk, Pk+1, . . . , Pn]is the current session and Pn+1 is the next page request,then, the page Pn+1 can be appended to this session as fol-lows:

• If the Link(Pn, Pn+1) = true then the new sessionbecomes: P ∗ = [P1, P2, . . . , Pk, Pk+1, . . . , Pn].

• If the Link(Pn, Pn+1) = false then

– If ∃Pk∈P satisfying following properties such that:

∗ Link(Pk, Pn+1) = true, and

∗ (∀Px ∈P and T (Px) > T (Pk)) ⇒ Link(Px

, Pn+1) = false, that means If Pk is the near-est (with the largest timestamp smaller thanthe timestamp of Pn+1) page that refers toPn+1, then the new session becomes P ∗ =[P1,P2, . . . , Pk, Pk+1, . . . , Pn, Pn−1, Pn−2, . . . , Pk,Pn+1]

– If there ∄Pk ∈P satisfying the properties above,then Pn+1 becomes the first page of a new session.

Although the session construction mentioned above arenot new, most of the recent applications [3, 13, 14, 15, 16,18, 22] use them during session generation. The problemwith these three methods is that the application side sufferfrom low performance due to the incorrect set of user ses-sions generated. It is very important to produce high qualitysessions in session construction since the quality of the usersessions effects the quality of the frequent patterns discov-ered in Web Usage Mining process. Our main goal here is toproduce high quality sessions, so that it leads to more cor-rect frequent patterns that will increase the performance inapplications like recommendation systems, personalization,web user clustering and semantic web usage mining. In thenext subsection, we explain our new session constructionmethod in detail.

2.2 Smart-SRASmart-SRA stands for Smart Session construction Algo-

rithm which is designed to overcome deficiencies of the timeand the navigation oriented heuristics. As it is given in themore general session definition, the session concept is alsoa sequence of page requests in Smart-SRA. Moreover, theultimate aim of web usage mining is to determine frequentuser access paths. Thus, session construction from serverlogs is an intermediate step. In order to determine frequentuser access paths, potential paths should be captured in theuser sessions. Therefore, rather than constructing just userrequest sequences from server logs, Smart-SRA uses a novelapproach to construct user session as a set of paths in theweb graph where each path corresponds to users’ navigationsamong web pages. That is, in Smart-SRA, server request logsequences are processed to reconstruct web user session notas a sequence of page requests, but, as a set of valid naviga-tion paths. The definition of the Smart-SRA session is givenbelow:

Definition (Smart-SRA Session): S = {S1, S2, . . . , Sn}is a session constructed by Smart-SRA having n paths, suchthat each Sx = [P1, . . . , Pi, P i + 1, . . . , Pn]∈S, satisfies thefollowing conditions:

• Timestamp Ordering Rule:

– ∀i : 1 ≤i < n, T (Pi) < T (Pi + 1)

– ∀i : 1 ≤i < n, T (Pi+1) − T (Pi) < σ(page staytime)

– T (Pn) − T (P1) < δ(session duration time)

• Topology Rule:

– ∀i : Link(Pi, Pi+1) = true

• Maximality Rule:

– ∀Sx ∈S ∄Sy such that Sy ∈S and Sx ⊂Sy

The timestamp ordering condition simply represents thestandard user session definition, so that each path in thesession constructed by Smart-SRA satisfies the requirementsof a session. The topology condition is introduced to forceeach user navigation path to correspond to a path in theweb site graph. Even though a user may perform actions likegoing to a previously visited page by using the back button ofthe browser or follow a link from a previously visited and stillretained page, an ultimate goal we are interested in frequentuser access paths, especially access sequence from web server


163

during navigation on web graph. In addition, Smart-SRAonly produces maximal paths in the session set generated.The motivation behind maximal paths are the coverage ofpossible paths they provided. We proved with maximalityproperty that all possible paths that can be generated fromuser access sequence must be subset of at least one maximalpath.

As it is clearly seen from the above rules, Smart-SRAuses three important properties which must be satisfied byall paths of a session generated by Smart-SRA. It is impor-tant that Smart-SRA eliminates the need for inserting thebackward browser movements of navigation-oriented heuris-tic and it preserves the timestamp order of web pages. Thetwo main phases of Smart-SRA are explained below:

• In the first phase, the access data stream of web usersare partitioned into shorter page request sequencescalled candidate sessions, by using session durationtime and page-stay time rules. The page request se-quences obtained in this manner correspond to sessionsobtained by using both of the time-oriented heuristicsmentioned above.

• In the second phase, candidate sessions are dividedinto maximal sub-sessions such that for each consecu-tive page pair in the sequence there exists a link fromprevious one to latter one. At the same time, page staytime rule for consecutive pages is also satisfied. Sincethe total session duration time is checked in the firstphase, there is no need to re-check it. However, pagestay time criteria must be checked again since the con-secutive page pairs cannot match in the sub-sessionswith candidate session obtained after first phase. Thesecond phase also adds topology rule and eliminatesthe need for inserting backward browser movements.This is achieved by repeating the following steps untilall pages in a candidate session obtained after the firstphase have been processed:

P13 P1

P49

P20 P23

P34

Figure 1: An example web site topology graph

The pseudocode for the second phase of Smart-SRA isgiven in Algorithm 1. In phase-2 of Smart-SRA, only max-imal sequences are kept through the iterations. Also, onlyrequired sessions are propagated, and thus, there is no re-dundant session construction. Consider the Table 1, thatshows a sample web page requests sequence of a web userobtained after the first phase of the algorithm, and Figure 1representing the web topology graph. The application of theinner loop (while loop) of the second phase of the sessionconstruction algorithm is given in Table 2. For this exam-ple, when the algorithm is executed, it discovers the three

maximal sessions showed in 4th iteration of NewSessionSetsatisfying three conditions described in 2.2.

Algorithm 1 Second Phase of Smart-SRA

1: ForEach CandSession in CandidateSessionSet2: NewSessionSet := {}3: while CandSession 6= []4: TSessionSet := {}5: TPageSet := {}6: ForEach Pagei in CandSession7: StartPageF lag := TRUE8: ForEach Pagej in CandSession with j > i9: If (Link[Pagei, Pagej ] = true) and

(TimeDiff(Pagej , Pagei) ≤ σ) Then10: StartPageF lag := FALSE11: End For12: If StartPageFlag = TRUE Then13: TPageSet := TPageSet ∪ {Pagei}14: // Remove the selected pages from the current seq.15: CandSession := CandSession − TPageSet16: If NewSessionSet = {} Then17: ForEach Pagei in TPageSet18: TSessionSet := TSessionSet ∪ {[Pagei]}19: Else20: ForEach Pagei in TPageSet21: ForEach Sessionj in NewSessionSet22: If (Link[Last(Sessionj), Pagei] = true) and

(TimeDiff(Last(Sessionj), Pagei) ≤ σ) Then23: TSession := Sessionj

24: TSession.mark := UNEXTENDED25: TSession := TSession • Pagei // Append26: TSessionSet := TSessionSet ∪ {TSession}27: Sessionj .mark := EXTENDED28: End If29: End For30: End For31: End If32: ForEach Sessionj in NewSessionSet33: If Sessionj .mark 6= EXTENDED Then34: TSessionSet := TSessionSet ∪ {Sessionj}35: End If36: End For37: NewSessionSet := TSessionSet38: End While39: End For

Table 1: Example Web Page Request SequencePage P1 P20 P13 P49 P34 P23

TimeStamp 0 6 9 12 14 15

2.3 Extension to Referrer CaseWe have also extended Smart-SRA to work with other

log formats which has referrer information. In order touse this information, Smart-SRA keeps the id of referrerwith the current page id in the candidate session. The ex-tension to referrer case is very easy since we only need tochange the topology check operation in the 9th and 22nd

lines of the Algorithm 1. Instead of checking link existenceif(Link[Pagej , Pagei] = true), referrer check if(Pagei.


164

Table 2: Evaluation of Example Session by Smart-SRA

Iteration 1 2

CandidateSession[P1, P20, P13, [P20, P13, P49,P49, P34, P23] P34, P23]

TempPageSet {P1} {P20, P13}

NewSessionSet[P1, P20]

[P1] [P1, P13]Iteration 3 4

CandidateSession [P49, P34, P23] [P23]TempPageSet {P49, P34} {P23}

NewSessionSet[P1, P13, P34] [P1, P13, P34, P23][P1, P13, P49] [P1, P13, P49, P23]

[P1, P20] [P1, P20, P23]

referrer = Pagej) is used in this new version of Smart-SRA. Obviously referrer case produce less number of ses-sions than original Smart-SRA since each page has exactlyone referrer. However, for original Smart-SRA, we have con-sidered restricted topology containing only pages within thecandidate session instead of using the whole topology. Thisproperty enables original version of Smart-SRA to producesmall number of sessions as the referrer based version also.In the experimental results section, we will show that thereis no significant difference in terms of the number of gener-ated sessions for both versions.

3. DISCOVERING PATTERNSSequential pattern discovery is the next phase of the Web

Usage Mining. In this phase, frequent access patterns aredetermined from reconstructed sessions. There are severalalgorithms in the literature for the sequential pattern min-ing, such as GSP [25], SPADE [28] and PrefixSpan [20] etc.Although it is not the most recent or the most efficient one,we have used a modified version of the AprioriAll [1] tech-nique. AprioriAll is very suitable for our problem since wecan make it very efficient by pruning most of the candi-date sequences generated at each iteration step of the al-gorithm. This pruning can be done because of the topo-logical constraint requirement mentioned above, that is forevery subsequent pair of pages in a sequence the former onemust have a hyperlink to the latter one. We will call thisnew version of AprioriAll as Sequential Apriori Algorithm.In particular, we prune majority of the possible sequencesin the search space during candidate sequence generation.Therefore, our focus is on the quality of discovered web us-age patterns rather than performance of sequential patternmining methods. Another different constraint in our domainis that a string matching constraint should be satisfied be-tween two sequences in order to have support relation. Toillustrate; the sequence < 1, 2, 3 > does not support < 1, 3 >although 3 comes after 1 in both of them. However, sequence< 1, 3, 2 > supports < 1, 3 >. A session S supports a patternP if and only if P is a subsequence of S not violating stringmatching constraint. We call all the sessions supporting apattern as its support set.

Sequential AprioriAll Algorithm (Algorithm 2): Inthe beginning, each page with sufficient support forms alength-1 supported pattern. Then, in the main step, for eachk value greater than 1 and up to the maximum reconstructedsession length, supported patterns (patterns satisfying the

Algorithm 2 Sequential Apriori

1: Input: Minimum support frequency: δ, ReconstructedSessions: S

2: Topology information as matrix: Link, The Set of allWeb Pages: P

3: Output: Set of maximal frequent patterns: Max4: Procedure sequentialApriori (δ, S, Link, P)5: L1 := {} // Set of frequent length-1 patterns6: For i:=1 to |P | do7: L1 := L1 ∪ [Pi] | If Support([Pi],S) > δ8: For k = 1 to N − 1 do9: If Lk = then

10: Halt11: Else12: Lk+1 := {}13: For Each Ii ∈ Lk

14: For Each Pj ∈ P15: If Link[Last(Ii), Pj ] = true then16: T := Ii • Pj // Append Pj to Ii

17: If Support(T, S) > δ then18: T.maximal := TRUE19: Ii.maximal := FALSE // since extended20: V := [T2, T3,. . . , T|T |] // drop first element21: if V ∈ Lk then22: V.maximal := FALSE23: Lk+1 := Lk+1 ∪ {T}24: End For25: End For26: End If27: End For28: Max := {}29: For k := 1 to N − 1 do30: Max := Max ∪ {S | S ∈ Lk and S.maximal =

true }31: End For32: End Procedure

support condition) with length k+1 are constructed by usingthe supported patterns with length k and length 1 as follows:

• If the last page of the length-k pattern has a link tothe page of the length-1 pattern, then by appendingthat page length-k+1 candidate pattern is generated.

• If the support of the length-k+1 pattern is greater thanthe required support, it becomes a supported pattern.In addition, the new length-k+1 pattern becomes max-imal, and the extended length-k pattern and the ap-pended length-1 pattern become non-maximal.

• If the length-k pattern obtained from the new length-(k+1) pattern by dropping its first element was markedas maximal in the previous iteration, it also becomesnon-maximal.

• At some k value, if no new supported pattern is con-structed, the iteration halts.

Notice that in the sequential apriori algorithm, the pat-terns with length-k are joined with the patterns with length-1 by considering the topology rule. This step significantlyeliminates many unnecessary candidate patterns before evencalculating their supports, and thus increases the perfor-mance drastically. In addition, since the definition of the


165

support automatically controls the timestamp ordering rulewith the sub-session check, all discovered patterns will sat-isfy both the topology and the timestamp rules, which arevery important in web usage mining.

Support(I, S) =|{Si|∀i I is substring of Si}|

|S|(1)

An auxiliary function Support(I, S) determines whethera given pattern (I) has sufficient support from the givenset of reconstructed user sessions (S). Support of a patternI is defined as a ratio between the numbers of constructedsessions supporting the pattern I, the number of all sessions.Clearly, the support for each candidate pattern at step-ncan be calculated only by one scan through the transactiondatabase by keeping candidate sessions in hashmap.

4. AGENT SIMULATOROur agent simulator first randomly generates a typical

web page topology which preserves power law property ofweb graphs as it is stated for graph generators [6]. Afterthat, it simulates a user agent that accesses this domainfrom its client site and navigates (randomly) in this domainlike a random surfer in pagerank [19] by using the parame-ters described below. In this way, we will have full knowl-edge about the sessions beforehand, and later we can useany heuristic to process user access log data to discover thesessions. Then, we can evaluate how successful that heuris-tic was in reconstructing the known sessions. While gen-erating a session, our agent simulator eliminates web usernavigations provided via a client’s local cache. Since thesimulator knows both the referrer of each page (i.e., fromwhich page a hyperlink is used to access to this page) andthe full navigation history at the client side, it can determinenavigation requests that are served by the web server, andthose served from the client/proxy cache. Each heuristic isexecuted using the server side log file as input, and producesits own sessions. After all of the sessions are discovered bythe heuristics, sequential Apriori-All technique is applied onthese sessions to determine frequent navigation paths.

The following parameters are used for simulating naviga-tion behaviors of a web user:

Session Termination Probability (STP): STP is theprobability of terminating session.

Link from Previous pages Probability (LPP): LPPis the probability of referring next page from one of the previ-ously accessed pages except the most recently accessed one.This parameter is used to allow the generation of backwardmovements from browser.

Link from Current page Probability (LPC): LPC isthe probability of referring next page from the most recentlyvisited page.

New Initial page Probability (NIP): NIP representsthe probability of selecting one of the starting pages of a website during the navigation, thus starting a new page requestsequence in the same domain. This behavior also representsbrowser actions to view new page without using links on theweb pages like using bookmarks or writing new address tobar.

Our user model is similar to the random surfer model ofthe page rank algorithm [19]. In the original page rank algo-rithm, the web user terminates surfing with closing his/herbrowser or writing a new page URL to the address bar of

browser with a damping factor, d (probability value between0 and 1). Also, with probability 1−d the user continues nav-igation by randomly clicking links on the current page. Ouragent simulator generates web user requests such that withprobability (1− d) a web user can select a link from currentpage (LCP) or press the back button (LPP). If this conditionhappens, the web user selects the next page from the linksof the previous pages or from the links of the current pageaccording to LPP and LPC values. Also with the probabil-ity d, web user can terminate the session (STP) or jump toa new start page (NIP) by typing its URL into the addressbar of the browser. Corresponding actions are simulatedwith probabilities STP and NIP.

5. EXPERIMENTAL RESULTSThe first subsection is dedicated to the introduction of

the accuracy metric for comparing different web usage min-ing methods. In the next two subsections, we compare theaccuracies of the maximal frequent patterns obtained fromsessions of Smart-SRA, navigation oriented and time ori-ented heuristics. In the fourth subsection, we have com-pared the referrer based version and the original version ofSmart-SRA on web server logs of ceng.metu.edu.tr domain(Department of Computer Engineering at Middle East Tech-nical University) which has more than 1500 unique sessionsand nearly 5K web page requests every day. In the last sub-section, we discuss distributed processing of web usage dataon map/reduce framework.

5.1 Accuracy MetricIn the experiments mentioned in section 5.2 and 5.3, we

have compared the most popular 3 heuristics with the Smart-SRA with respect to their accuracy performance. As it ismentioned in previous sections, sessions are processed bythe sequential apriori algorithm after they are constructedby one of these four heuristics. Sequential apriori techniqueis also applied to original sessions generated by the agentsimulator. Since, we know the frequent maximal patterns ofthe agent simulator (MPA), which correspond to the correctfrequent patterns, we can determine the accuracies of differ-ent heuristics (AH stands for the accuracy of a heuristic H)by using the maximal frequent patterns generated by theseheuristics (MPH) as follows:

RECH =|MPA ∩ MPH |

|MPA|PREH =

|MPA ∩ MPH |

|MPH |(2)

AH =√

(RECH ∗ PREH) (3)

Here RECH is the recall for heuristic H namely the ratioof the number of correct patterns captured by heuristic Hover the number of all correct maximal patterns. PREH isthe precision for heuristic H namely the ratio of the numberof correct patterns captured by heuristics H over the numberof all maximal patterns generated by heuristic H. We havedefined accuracy AH of heuristic H as a geometric mean ofrecall and precision for that heuristic.

In order to balance the trade off between recall and pre-cision, it is possible to define arithmetic mean or any lin-ear function of these two variables as our accuracy metric.Since arithmetic mean is defined as (RECH + PREH)/2,it may still produce incorrectly high accuracies wheneverone of the RECH or PREH is very large and the other one


166

0

0,2

0,4

0,6

0,8

1

0,1 0,2 0,5 1 2 5 10

Accura

cy

NIP/STP

Accuracy vs NIP/STP

NO

SSRA

TO

(a) Accuracy vs NIP/STP

0

0,2

0,4

0,6

0,8

1

0,1 0,2 0,5 1 2 5 10

Accura

cy

LPP/LPC

Accuracy vs LPP/LPC

NO

SSRA

TO

(b) Accuracy vs LPP/LPC

0

10

20

30

40

50

60

70

80

Avera

ge A

ccura

cy

Methods

Methods vs Average Accuracy

NO

SSRA

TO

(c) Methods vs Avg. Accuracy

Figure 2: Experiments on Simulated Data

is very small. We are seeking for accuracy to be high ifand only if both of them are high. The accuracy shouldbe small even if one of these values is very large and theother one is very small. In general, due to the problemsmentioned in arithmetic mean, any weighted linear func-tions using d ∗ PREH + (1 − d) ∗ RECH(d ∈ [0, 1]) alsofail. The geometric mean function [2, 4] solves all problemsmentioned above. For obtaining high accuracy both preci-sion and recall must be high. Therefore, we decided to usegeometric mean as our accuracy metric thought the rest ofthis paper.

5.2 Accuracy Comparison on Simulated DataIn the first set of experiments, we have used agent simu-

lator to produce syntactic data. The agent simulator firstcreates a random web topology with the parameters givenin Table 3. There are works in the literature related withthese parameters, and the average number of web pages ina web site is reported as 441 [27]. Also, the average numberof out degrees of any web page is determined as 7.2 in [17].Therefore, we decided to choose the ranges of parameters asin Table 3.

Table 3: Web Site and User ParametersParameter Range

Number of web pages (nodes) in topology [10, 1000]Number of Users [1000, 10000]

After the initializing steps of simulation mentioned above,we have studied the accuracy of alternative web usage min-ing processes (which contains the session construction phasefollowed by application of the sequential version of Apriori-All algorithm for frequent patter discovery). In this work, wecompare the accuracies of the three web usage mining pro-cesses employing Smart-SRA, Navigation Oriented Heuris-tic and Time Oriented Heuristic using total session durationtime threshold. We didn’t plot the results for Time OrientedHeuristics using page stay time threshold since its accuracyis very close to Time Oriented Heuristics using total sessionduration time.

The main parameters used in our experiments are twoprobabilistic behavior metrics, namely the ratios (LPP/LPC)and (NIP/STP ). We used 7 different values of these twoparameters, namely the values 0.1, 0.2, 0.5, 1.0, 2.0, 5.0, and10.0 That is, 0.1 for (LPP/LPC) means, the probability ofLink from Current Page behavior is ten times as the proba-

bility of Link From Previous Page behavior. Similarly, 10 for(LPP/LPC) means, the probability of LPP behavior is tentimes as the probability of LPC behavior. Thus, with these7 values we cover different distributions of LPP/LPC ratios.Since we have the same values for NIP/STP ratio also, intotal we generate 49 different combinations of all of these4 parameters. For each one of these 49 different cases, ouragent simulator is executed 10 times with randomly chosennumber of users, out degrees of web pages, and the numberof web page parameters from the ranges in Table 3. Alsofor each test run of the agent simulator, we have executedpattern discovery phase with the 5 different support valuesfrom 0.001 to 0.010 (which are 0.001, 0.0025, 0.005, 0.0075,and 0.01). Thus, for each of the 49 different combinationsof (NIP/STP, LPP/LPC), we have 50 different test runs(10 random different runs of agent simulator times 5 runsfor each support values).

After that we take the average of 50 runs for each of the(NIP/STP, LPP/LPC) combination. As a result, we ob-tain single accuracy value for each of the 49 combinationswhich can be represented as a 7x7 matrix having varyingLPP/LPC values on rows and varying NIP/STP valueson columns. In order to provide complete visualization ofall possible parameters, we plot the accuracy results of ourexperiments on syntactic data in 3 different views. In Fig-ure 2(a), we plot the average accuracy of the 7 rows whichcorresponds to average accuracies with respect to varyingNIP/STP values. Figure 2(b) contains the reverse resultof the former one, since in this one we take the average ofeach column and plot the accuracies with respect to varyingLPP/LPC values. The last figure (Figure 2(c)) shows theaverage of all 49 values for each session construction methodin terms of accuracy, recall and precision values.

On the average, the maximal patterns obtained from Smart-SRAs sessions are at least 30% more accurate than pat-terns obtained from sessions of Time and Navigation Ori-ented Heuristics. Another observation is that increasingNIP/STP and LPP/LPC decreases the accuracy of allheuristics. This is due to the fact that LPP and NIP causemore complex behaviors in the user sessions than STP andLPC. Remember that LPP is the probability referring nextpage from one of the previously accessed pages except themost recently accessed one. High LPP means that web userstend to choose the next page by using links from previouslyaccessed web pages. Clearly this behavior is more difficultto predict than choosing next page from the most recently


167

0

10

20

30

40

50

60

70

80

90

100

Recall Precision Accuracy

Valu

es

Accuracy Metrics vs Values

NO

SSRA

TO

(a) Accuracy on Real Data

0

20

40

60

80

100

120

10K 50K 100K 250K 500K 750K 1M

Ru

n T

ime

(se

c)

Input size

Input Size vs Run Time (seconds)

Smart-SRA (Org) Smart-SRA (Ref)

(b) Input Size vs RunTime

0

10

20

30

40

50

60

2 3 4 >=5

Ratio

Sequence Length

Sequence Length vs Ratio

Org

Ref

(c) Sequence Length vs Ratio

Figure 3: Experiments on Real Data

accessed page since in the latter case, the order is the samein the web server logs. The same argument also appliesto NIP, since increasing NIP leads to web users to triggernew navigation sequence by writing new URL into the ad-dress bar or using external links from different sources suchas search engines or bookmarks. However, increasing STPleads to web users to generate simple sessions that are termi-nated early with less number of pages. This also decreasesthe probability of having NIP and LPP conditions whichenables heuristics to capture sessions easily.

5.3 Accuracy Comparision on Real DataWe have also evaluated the accuracy performance of both

of the heuristics on real data captured from AGMLAB’scompany web site (www.agmlab.com). We tracked the ac-tions of web users between March 2008 and June 2008. Inthis period of time, the number of visitors is 3801 which iscalculated by using a 30 min session time-out. The web sitefor our case study contains 10 web pages and the link graphis densely connected.

For tracking client side actions of web users, the internalstructure of web pages are changed by inserting an actiontracking program into them. This program captures all ac-tions of web users by using browsers on the client-side. Theseactions are stored as a text in a local cookie. When the userperforms any page view action via browser, the informationin local cookies is sent to server. In the server side, the cookieinformation is recorded to a log file by requested dynamicserver page. By this way, we capture all of the informationabout the user on both the client and the server sides.

The information in our cookies represents all actions ofusers on the client side. Also, in the sequence extracted fromthe cookie, the first instance of each requested page standsfor access log information in the server side. To illustrate; ifthe cookie contains sequential page view as [1, 2, 6, 3, 2, 1,5, 3, 2] it is easily seen that bold ones are the first instancesof requested web pages and these are provided by the webserver and recorded to the access log file. So, the log filecontains the sequence [1, 2, 6, 3, 5]. The access logs filesgenerated from cookies in this manner is used as an inputto each heuristic.

We have also extracted the real session sequences from thecookie information. For the example mentioned above, thesecond visit of page 2 ([1, 2, 6, 3, 2, 1, 5, 3, 2]) is providedvia browser cache, so our first sequence becomes [1, 2, 6, 3].The first instance of page 5 comes after initial sequence [1,2, 6, 3] and it is the page highlighted as bold ([1, 2, 6, 3,

2, 1, 5, 3, 2]). Since the referrer of page 5 is page 1 in thecookies, the second sequence becomes [1, 5]. The browserprovides page 1 from the local cache. Since page 1 is the firstpage of the sequence, there is no referrer of page 1 when it isrequested from the server. Notice that the agent simulatorproduces sessions having forward references which are usedfor requesting web pages from the server. The user clicks alink on page 1 and requests page 5 from the server. However,this is not same for [2, 1] shown as bold, ([1, 2, 6, 3, 2, 1, 5,3, 2]). Finally, our real session contains two sequences {[1, 2,6, 3], [1, 5]}. After all real sessions are extracted; maximalpatterns are discovered by using Sequential Apriori method.After that, we evaluate the accuracy for each heuristic byusing the methods described in the previous section.

For the evaluation of maximal patterns, we have used fivedifferent support values in the range [0.01, 0.1] with thevalues 0.01, 0.025, 0.05, 0.075, and 0.1. Then, we have cal-culated the average accuracy for these five different supportvalues. The average accuracies for each heuristic on realdata are given in Figure 3(a). As it is seen from the figure,Smart-SRA is much better than previous heuristics. WhileSmart-SRA has accuracy near 60%, the navigation orientedheuristic has accuracy near 25% and time oriented heuristichas accuracy near 27%. From the experimental results onour real data we can also observe several drawbacks of pre-vious heuristics, which were also reflected in these results.First, time-oriented heuristics don’t consider link informa-tion. In most of the cases, a web user navigates betweenpages by using hyperlinks in the pages. When the link infor-mation is not used, the sessions become an unrelated set ofaccessed web pages without navigation paths. In navigation-oriented approach, backward browser movements becomes amajor problem, since the rest of the session always corre-sponds to forward movements in web topology graph. Also,Navigation oriented heuristics only consider the nearest pagefor navigating a new page which current one does not refer.

5.4 Experimental Results on Large Web SiteIn this section, we provide experimental results over web

server logs of ceng.metu.edu.tr for analyzing effect of the re-ferrer information, seqeunce length and accuracy. Our testdomain is visited by more than 1500 unique visitors request-ing about 5K web pages daily. We have used six monthserver logs (which contains nearly 1M web page requests) forour experiments and focused on determining the effects ofthe referrer information on the performance of two versionsof Smart-SRA, namely the one that uses the referrer infor-


168

smart miner: a new framework for mining large scale web usage data

Documents