[communications in computer and information science] trends in computer science, engineering and...

D. Nagamalai, E. Renault, and M. Dhanushkodi (Eds.): CCSEIT 2011, CCIS 204, pp. 94–100, 2011. © Springer-Verlag Berlin Heidelberg 2011

Mining Web Path Traversals Based on Generation of FP Tree with Utility

E. Poovammal and Pillai Cigith

Department of Computer Science, SRM University, Kattankulathur, Chennai 603203, Tamil Nadu, India

[email protected],[email protected]

Abstract. Web Mining is mining information from the Web. Web Usage Mining extracts user accessing patterns and gives the desired web page quickly. While accessing web, the user activities gets recorded in a web server log. Using data mining algorithm, if mined properly, the web log acts as the gateway to user’s interests. But, merely considering presence or absence of web page in web log does not gauge its importance to user. Hence, in this work two parameters namely frequency of the access pattern and utility value are used to find better access patterns of the user and succeeded. This paper uses the method of Frequent Pattern (FP) Tree generation to find frequency of access pattern and the time spent by user on each web page is considered as the utility value.

Keywords: FP Tree, Utility, Usage patterns, Web Usage Mining, frequent user access patterns.

1 Introduction

The World Wide Web (WWW) is a huge galaxy of information which acts as a source for rich and dynamic collection of hyperlink data with usage characteristics. Data mining techniques are widely applied to obtain useful and understandable patterns from these massive sources of data. All the data mining techniques exploited on the Web comes under the category of Web mining. Web mining can be broadly classified into three parts, i.e. content mining, usage mining, and link structure mining. Web Content Mining describes the discovery of useful information from the web content/data/documents such as text, image, audio and video. Web Structure Mining is the process of discovering knowledge from the World Wide Web organization and links across the webs. Web Usage Mining is the process of extracting interesting patterns or knowledge from various web access log records.

A Web server usually registers a log entry for every access of a Web page in a web log. Web Usage Mining deals with performing data mining techniques on web log entries to find association patterns, sequential patterns, and trends of Web accessing. Analyzing and exploring regularities in web log entries can identify potential customers for electronic commerce, improve the quality and performance of Web services system, and optimize the site architecture to cater to the preference of end users.

Mining Web Path Traversals Based on Generation of FP Tree with Utility 95

When it comes to mining frequent patterns, most of the studies carried out previously adopted an Apriori [1]-like approach based on the anti-monotone property: if any length k pattern is not frequent in the database, its length (k + 1) super-pattern can never be frequent.

But an Apriori-like algorithm can suffer from two nontrivial costs:

It is costly to handle a huge number of candidate sets. It is tedious to repeatedly scan the database and check a large set of candidates

by pattern matching

Hence, we use another data mining algorithm known as FP Tree [2],[3] generation which have the following advantages:

It does not involve creation of candidate sets It makes use of a compact data structure called as FP Tree which is an extended

prefix tree structure storing important information about frequent patterns. It makes use of partitioning based divide- and- conquer approach instead of

Apriori’s level wise approach.

However, considering only the binary frequency for each website of a web traversal path is not a good measure for finding interesting web traversal paths because each user may spend different time period for each website.

This paper uses FP tree with utility value to give more relevant results to the user. This approach first pre-processes the web server log and with that data as input to the FP Tree algorithm mines for finding out frequent patterns. After mining the utility value is calculated which results in useful patterns to the user and helps in decision making.

The remainder of this paper is divided as follows. In section 2 we describe related work. In section 3 we describe the problem definition. In section 4 we present the experiments and their results. Finally conclusions are presented in section 5.

2 Related Work

Chen et al. [4] developed two efficient algorithms for determining large reference sequences for web traversal paths. The first one, called full-scan (FS) algorithm, solved discrepancy between traversal patterns and association rules. The second one, called selective-scan (SS) algorithm is able to avoid database scans in some passes so as to reduce the disk I/O cost involved. But both these algorithms had some limitations like they required large number of database scans, candidate generation for large databases where difficult.

Sequential pattern mining [5] discovered sequential patterns from customer sequence databases. But sequential pattern mining suffered from mining closed and maximal sequential patterns as well as mining approximate sequential patterns. The Two-Phase algorithm [6], [7] was developed, based on the definitions of [8] to find high utility itemsets using the downward closure property of Apriori. The authors [6], [7] have defined the transaction weighted utilization (twu) and proved that it is possible to maintain the downward closure property. This algorithm suffers from the same problem of the level-wise candidate generation-and-test methodology.

96 E. Poovammal and P. Cigith

A novel approach of finding frequent patterns without generating candidate sets was proposed in the FP Tree generation algorithm. This method uses a data structure called FP Tree which stores the database in a compressed form. A brief comparison between Apriori algorithm and FP Tree algorithm with its advantages and disadvantages can be studied in [9].

Zhou et al. [10] introduced the concept of utility in web path traversal mining model. They adopted the definitions of utility from the high utility pattern mining model [6], [7], [8], [11]. They expressed the browsing time of a user as a utility of a website. Their algorithm is based on the Two-Phase [6], [7] high utility mining algorithm which suffers from the level-wise candidate generation-and-test methodology of the Apriori algorithm. Hence, they also suffered from generating too many candidate patterns and need several database scans to find out the frequent web traversal paths. Also combining both frequency and utility to find web path traversals was still not explored.

So, a more effective algorithm EUWPTM (Effective Utility based Web Path Traversal Mining) [12] was introduced which was a modification of the previous work and is based on pattern sequential approach.

3 Problem Definition

We start with the basics of all the important terms that lead to a formal definition of web path traversal mining using both FP Tree generation and Utility value.

A Web server log file contains requests made to the Web server, recorded in a chronological order. The most popular log file formats are the Common Log Format (CLF) and an extended version of the CLF, the Extended CLF. A line in the ECLF is as shown in Figure 1. Some of the fields it contains are as follows:

The client's host name or its IP address The date and time of the request The operation type (GET, POST, HEAD, etc.) The status code of the request (200, 404 etc.) Referrer Link, type of OS and the type of browser.

A sample weblog data is as follows

192.168.0.1 - - [29/Sep/2004:17:10:31 +0200] \GET /axis/people.shtml HTTP/1.1"200 8289 \http://www-sop.inria.fr/axis/table.html" \Mozilla/4.0 (compatible; MSIE6.0; Windows NT 5.2; .NET CLR 1.1.4322)"

Fig. 1. A web request from INRIA’s Web Server Log in the ECLF Format

When a user starts his session he moves from one web page to a different web page. He moves around the Web and leaves back a trace of his activities in the web server log. Pre-processing the web server log is the most important step in using the log for mining. After pre-processing is done all the vital information needs to be stored in a database. This database can be a data structure also.

A web path traversal is the set of links accessed by the user in his one particular session. For e.g. if ‘a’,’ b’, ‘c’, ‘d’ are four websites and the user first goes to web page ‘a’ followed by ‘c’, ‘b’ and ‘d’ in one session then the path traversal for the user is a-c-b-d.


Let I = i1, i2 …im be a set of links, and a Weblog W = T1, T2… Tn where Ti is a traversal path which contains a set of links in I.

The support (or occurrence frequency) of a pattern A, which is a set of links, is the number of traversals containing A in W. A, is a frequent pattern if A’s support is no less than a predefined minimum support threshold S. Given a Weblog W and a minimum support threshold, S, the problem of finding the complete set of frequent path traversals is called the frequent path traversal mining problem. These web path traversals are provided as the input to the FP Tree algorithm [2].

In this algorithm, first a data structure called frequent-pattern tree is constructed which stores significant quantitative information about the frequent patterns. Secondly, an FP-tree based pattern-fragment growth mining is developed starting from a length-1 pattern (initial suffix pattern) examining only its conditional-pattern base (a sub database which consists of the set of frequent items co-occurring with the suffix pattern), constructs its (conditional) FP Tree and performs mining recursively with such a tree.

The pattern growth is obtained by concatenating the suffix pattern with the newly generated from the conditional FP Tree. And finally, the search techniques used is a partitioning-based, divide-and conquer method rather than Apriori-like level-wise generation of the combinations of frequent itemsets. It employs the least frequent items as suffix, which offers good selectivity. All these techniques contribute to substantial reduction of search costs.

Merely considering the presence or absence of a web page does not really judge the importance of a web page. Different web pages mean different to different users. Hence some parameter is chosen which acts as a measure of the importance to the particular user. Here the parameter taken is the time spent by each user on the web page known as the utility [10] value.

3.1 The Proposed Approach

Pre processing the web server log [13] results in the removal of useless data like searches made by bots, spiders, Page Not Found error(status code:404) etc. This cleaned data is the input to the FP Growth algorithm. The utility value is calculated using the following steps:

i. The identification of users is done based on their ip addresses. Here the assumption made is that each individual user uses the same machine.

ii. The utility value of a web page is calculated using the timestamp values of consecutive requests of the user. For e.g.

Table 1. A sample server log

Users Links Traversed 174.129.130.230 A(04:23:56) 83.109.145.234 P(04:24:07) 174.129.130.230 C(04:24:11) 83.109.145.234 S(04:24:13)


Table 2. Transformed one

Users Links Traversed 174.129.130.230 A(04:23:56), C(04:24:11)

Consider the sample of the pre-processed server log shown in Table 1. Table 1 is pre processed and the result is shown in Table 2.

And finally the utility value for page A is calculated as follows:

Utility (A) = Timestamp (C) – Timestamp (A) (1)

So, in this example Utility (A) = 15 secs. iii. At the end, those links are selected whose utility value satisfies a minimum

utility threshold (Ω) value.

4 Experimental Results

To evaluate the performance of the proposed method, experiments were conducted on the log file access.log of web server of www.vinrcorp.com which had 389 entries. For description of the entries in detail refer section 3 of this paper. Out of these entries there were requests from spiders and robots which are not useful in mining. So we removed them and the data was cleaned. This reduced the number of entries on which the algorithm is applied. After cleaning the data 362 entries were obtained. These entries were logged for 81 different users who searched for 138 unique web pages. Figure 2 shows the snapshot of the output of the parsed server log. The programs were written in Java and the IDE used was NetBeans 6.9.1. The programs were run on Windows 7 operating system on an Intel Core i3 processor with 2.26 Ghz CPU and 4GB RAM.

Fig. 2. shows a snapshot of the output of the parsed web server log. The values are separated by tab. To the extreme left is the ip address, in the middle is the time of access and at the end the link accessed.

The BMS-WebView-1 contains several months’ worth of click-stream data from two e- commerce web sites. Each transaction in these datasets is a web session


Fig. 3. shows the runtime performance curve for the BMS-Webview-1 dataset. It is measured against minimum utility threshold on the x-axis and runtime (in secs) on y-axis.

Fig. 4. shows the runtime performance curve for kosarak dataset. It is measured against minimum utility threshold on the x-axis and runtime (in secs) on y-axis.

consisting of all the product detail pages viewed in that session. The goal for the dataset is to find associations between products viewed by visitors in a single visit to the web site. This dataset contain 59,602 transactions (with 497 distinct items). Figure 3 shows the runtime performance curve for BMS-WebView-1 dataset.

Another dataset, the kosarak dataset contains web click-stream data of a Hungarian on-line news portal. It is a big dataset containing almost one million transactions (990,002) and 41,270 distinct items. Figure 4 shows the runtime performance curve for the kosarak dataset.

5 Conclusion

Discovering frequent Web accessing sequences from Weblog databases is not only useful in improving the website design but also lead to better marketing decisions. But considering only the frequency of the web pages does not reflect the importance of the web page for the user. It is because the user might not visit an important web page as frequently as say the user visiting web pages for daily news or even checking his personal e-mails. So taking into account web pages only on the basis of frequency will easily neglect a web page which is not frequent but important to the user. The quality of the web pages thus obtained makes a big impact on the time taken for searching effectively.


In this paper, an attempt has been made to combine two methods namely frequency of the access pattern and the Utility value. FP Growth algorithm finds the frequently accessed patterns while Utility mining technique incorporates the ‘utility value’ which measures the significance of a web page for the user.

The proposed method efficiently reduces the performance time. For a particular user this method gives a set of links which satisfies a minimum support threshold and also arranges all the links traversed by the user in descending order of the utility value.

The resulting links obtained from the proposed method are checked for both ‘frequency’ and ‘interestingness’ and it is found that the results are significant for decision making process. The links which were previously discarded because of their low frequency but were important were also included in the final result.

References

1. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: 20th International Conference on Very Large Data Bases, pp. 487–499 (1994)

2. Han, J., Pei, J., Yin, J.: Mining Frequent Patterns without Candidate Generation. In: ACM-SIGMOD International Conference Management of Data, Dallas, TX, pp. 1–12 (2000)

3. Aiman, M.S., Dr. Dominic, P.D.D., Dr. Azween, B.A.: A Comparative Study of FP-growth Variations. J. Computer Science and Network Security 09 (2009)

4. Chen, M., Park, J., Yu, P.: Efficient data mining for path traversal patterns. IEEE Transactions on Knowledge and Data Engineering, 209–221 (1998)

5. Yao, H., Hamilton, H.: Mining itemset utilities from transaction databases. Data & Knowledge Engineering 59, 603–626 (2006)

6. Liu, Y., Liao, W., Choudhary, A.: A fast high utility itemsets mining algorithm. In: 1st International Conference on Utility-Based Data Mining, pp. 90–99 (2005)

7. Liu, Y., Liao, W., Choudhary, A.: A Two Phase algorithm for fast discovery of High Utility of Itemsets. In: Ninth Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 689–695 (2005)

8. Yao, H., Hamilton, H., Butz, C.: A Foundational Approach to Mining Itemset Utilities from Databases. In: Third SIAM International Conference on Data Mining, pp. 482–486 (2004)

9. Santosh Kumar, B., Rukmani, K.: Implementation of Web Usage Mining Using APRIORI and FP Growth Algorithms. J. Advanced Networking and Applications 400(01), 400–404 (2010)

10. Zhou, L., Liu, Y., Wang, J., Shi, Y.: Utility-based Web Path Traversal Pattern Mining. In: Seventh IEEE International conference on Data Mining Workshops, pp. 373–378 (2007)

11. Pei, J., Han, J., Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: Mining sequential patterns by pattern-growth: The PrefixSpan Approach. IEEE Transactions on Knowledge and Data Engineering 16, 1424–1440 (2004)

12. Chowdhury, F., Syed, T., Byeong-Soo, J., Young-Koo, L.: Efficient Mining of Utility Based Web Path Traversals. In: International Conference on Advanced Communication Technology 2009 (2009) ISBN 978-89-5519-139-4

13. Mohd, W., Mohd, H.M., Hafizul, F.H., Mohamad, F.M.M.: Data Pre-processing on Web Server Logs for Generalized Association Rules Mining Algorithm. World Academy of Science, Engineering and Technology 48 (2008)

[communications in computer and information science] trends in computer science, engineering and...

Documents