jmt2010_3-5_-libre (1)

Journal of Mathematics and Technology, ISSN: 2078-0257, No.3, August, 2010

46

AN EFFICIENT METHOD FOR EXTRACTING THE USER PROFILE FROM DYNAMIC WEB SITE USING HUNC ALGORITHM

P. Boopathi 1, Prof. M.Sadish Sendil 2, Dr.S. Karthik 3

1 Final Year M.E-CSE, 2 Associate Professor, 3 Prof. and Head Department of Computer Science and Engineering, SNS College of Technology, Tamil Nadu (INDIA)

E-mail: [email protected] ABSTRACT Nowadays most of the companies have the web sites for their business. Most of the customers of the

organization register their details as user profiles The Web have the huge collection of documents and sophisticated knowledge extraction methods are required to effectively access the information they contain. Such methods include machine learning and data mining techniques for information categorization, extraction, and search, as well as for adapting to the interests of the users. To improve the customer relationship of the web user by sending the offer message and new product services etc. Then we can change the web site structure based on the customer requirements. The attractiveness of a web site, in terms of both content and structure, is crucial to many applications, e.g. a product catalog for e-commerce. Web usage mining provides detailed feedback on user behavior providing the web designer information on which to base redesign decisions.

Key words: web usage mining; user profiles; clustering; data stream. 1. INTRODUCTION Organizations that provide online services, ranging from e-commerce transactions to document and

multimedia browsing repositories, are in a continuous competition with each other to keep their existing customers and to lure new ones. Furthermore, this competition is increasing because of the relative ease of starting an online business and the fact that a competitors website is often just one or two clicks away. The challenging nature of the online market has motivated online companies to start monitoring their online usage activity data to better understand and satisfy their website users. However, tremendous amounts of online click streams take place every day, which results in huge amount of data, such that using conventional analysis methods to analyze is neither possible nor cost-effective. As a result, it has become imperative to use automated and effective data mining methods to turn this raw data into knowledge that can help online organizations to better understand their users.

The Web usage mining is the type of Web mining activity that involves the automatic discovery of user access patterns from one or more Web servers. As more organizations rely on the Internet and the World Wide Web to conduct business, the traditional strategies and techniques for market analysis need to be revisited in this context. Organizations often generate and collect large volumes of data in their daily operations. Most of this information is usually generated automatically by Web servers and collected in server access logs. Other sources of user information include referrer logs, which contains information about the referring pages for each page reference, and user registration or survey data gathered via tools such as CGI scripts.

The goal of Web usage mining is to capture and model the behavioral patterns and profiles of users interacting with a Web site. The discovered patterns are usually represented as collections of pages that are frequently accessed by groups of users with common needs or interests. Such patterns can be used to better understand behavioral characteristics of visitor or user segments, improve the organization and structure of the site, and create a personalized experience for visitors by providing dynamic recommendations. This approach would allow the system to recommend pages to a user, not only based on a matching usage profile, but also based on the content similarity of these pages to the pages user has already visited.

Most of the existing Web analysis tools provide mechanisms for reporting user activity in the servers and various forms of data filtering. Using such tools, for example, it is possible to determine the number of accesses to the server and the individual files within the organization's Web space, the times or time intervals of visits, and domain names and the URLs of users of the Web server. However, in general, these tools are designed to deal handle low to moderate traffic servers, and furthermore, they usually provide little or no analysis of data relationships among the accessed files and directories within the Web space.

2. RELATED WORK A. Overview of Web Usage Mining Web usage mining is the process of applying data mining techniques to extract useful knowledge such as

typical usage patterns from web log data. The analysis of the discovered usage patterns can help online organizations gain a variety of business benefits, such as developing cross product marketing strategies, enhancing promotional campaigns, and web personalization. Early research has been conducted in web usage mining [1,3,8,9] to address the challenges of pre processing, usage data analysis and modeling, and data mining


47

which ranged from clustering to association and sequential rule mining. Several tools have thus been developed [1,2,9,10,11] to infer usage patterns from web usage data (pattern discovery) and then to interpret these usage patterns (pattern analysis).

B. Web Usage Mining Process Typically, discovering the web usage patterns, such as profiles or prediction models, consists of three

steps: preprocessing the raw usage data, discovering patterns from the pre-processed data, and analyzing these discovered patterns.

Fig. 1. Web Usage Mining Process There are two primary tasks in pre processing: data cleaning, and transaction identification also known

as sessionization. Data cleaning eliminates irrelevant items such as image requests and search engine requests from the server log. The transaction identification process groups the sequences of page requests into logical units, each of which is called a session which is the set of pages that are visited by a single user within a predefined period of time. After pre processing, the web sessions are used as an input to pattern discovery methods that are typically rooted in areas such as data mining, artificial intelligence, or statistics. These discovery methods may include: statistical analysis, sequential pattern mining [16], path analysis [11,14,15 17], association rule mining [18,19], classification , and clustering [7,12,113]. After discovery, the usage patterns are analyzed to better understand and interpret them, using a variety of analysis tools from the fields of statistics, graphics, visualization, or database querying. Examples of analysis tools can be found in refs. [10, 15, 18].

C. The Hierarchical Unsupervised Niche Clustering Algorithm HUNC is a hierarchical version of the unsupervised niche clustering (UNC) algorithm. UNC is an

evolutionary approach to clustering proposed by Nasraoui and Krishnapuram in Ref. [8], that uses a genetic algorithm (GA) to evolve a population of cluster prototypes through generations of competition and reproduction. UNC has proven to be robust to noise, but was formulated based on a Euclidean metric space representation of the data. Later, the HUNC algorithm [10] was proposed to generate a hierarchy of clusters which give more insight into the web mining process, and makes it more efficient in terms of speed. HUNC does not assume the number of clusters in advance, can provide profiles to match any desired level of detail, and requires no analytical derivation of the prototypes. This allows HUNC to use specialized similarity measures based only on the user access patterns. HUNC algorithm is shown in Algorithm 1.

D. Tracking User Interests through Prior- Leaning of Context Recently, many systems have been developed that recommend information, products and other items.

These systems try to help users in finding pieces of information or other objects in which the users could be interested .In a similar way, adaptive hypermedia systems build a model of the goals and preferences of each user and use this model to adapt the interaction to the needs of the user. Many of those systems use machine learning methods for learning from observations about the user. However, user interests and preferences can change over time. Some of the systems are provided with mechanisms that are able to track drifting user interests. The problem of learning drifting user interests is relevant to the problem known as concept drift in the area of machine learning.

It is assumed that the user interests do not only change, but also possibly recur. The user interests can be quite wide and the user can currently focus her attention on a small subset of her broad interests. For example, the whole set of user interests in the case of Internet browsing can include interests that are relevant to her job, as well as her hobbies, etc. Even the user's job related interests could be quite extensive and interdisciplinary. A system that assists the user in web browsing should be flexible enough to recognize what her current interests are and provide her with relevant recommendations. A possible approach is to learn about current user interests from a time window that includes recent relevant observations only. However, if the current user interests often change, a precise user profile cannot be learned from a small set of relevant recent observations only. Hence, the system can search for past episodes where the user has demonstrated a similar set of interests and try to learn a more precise description of the current user interests, remembering relevant and forgetting irrelevant observations.

This system presents such an algorithm for tracking changing user interests and preferences in the presence of changing and recurring context. First, the algorithm learns about current context. Subsequently, it selects past episodes that are relevant to this context and eventually it learns concept descriptions from the selected sample profiles.


48

Web usage mining applies data mining techniques to the usage of Web resources, as recorded in Web server logs or other logs of requested URLs. Web logs were initially designed to help site administrators identify traffic and possible bandwidth problems, broken links, etc., and analyzed using simple statistics like hit and page view counts. More and more, their value for understanding site users behavior is also being recognized, and techniques like association rule mining, clustering, or sequential pattern discovery are being used to identify co-occurring items in browsing and shopping histories, different user segments, navigation strategies, etc.

3. DESIGN GOALS A. Profile Discovery Based On Web Usage Mining The framework for the Web usage mining and a road map, which starts with the integration and

preprocessing of Web server logs and server content databases, includes data cleaning and sessionization, and then continues with the data mining/ pattern discovery via clustering. This is followed by a post processing of the clustering results to obtain Web user profiles and finally ends with tracking profile evolution. The automatic identification of user profiles is a knowledge discovery task consisting of periodically mining new contents of the user access log files and is summarized in the following steps:

1. Preprocess Web log file to extract user sessions. 2. Cluster the user sessions or profiles by using Hierarchical Unsupervised Niche Clustering (H-UNC). 3. Summarize session clusters/categories into user profiles.


49

4. Enrich the user profiles with additional facets by using additional Web log data and external domain knowledge.

5. Track current profiles against existing profiles. 1) Pre-processing the Logs The first step in pre-processing is data cleaning where all irrelevant elements such as images, requests

from crawling agents, and unsuccessful requests are removed. The next step in pre processing is sessionization, where consecutive page requests with the same originating IP address, session, or cookie ID are grouped into units called sessions. Each session represents all the pages visited.

Fig. 2. Flow chart for clustering of user profiles

2) Clustering Sessions into an Optimal Number of Categories The main outline of the H-UNC algorithm is showed in previous section. The reason to use H-UNC instead

of other clustering algorithms is that unlike most other algorithms, H-UNC can handle noise in the data and automatically determines the number of clusters

3) Post processing and Enrichment of Session Clusters into Multifaceted User Profiles After grouping session into different clusters then summarize the session categories in terms of user

profile vectors. The profile properties include the following facets 1. Search queries. These are queries submitted to search engines before visiting the Web site for sessions

that belong to this profile. 2. Inquiring companies. These are companies/organizations of registered users or unregistered users

whose IP addresses can be mapped. 3. Inquired companies. These are companies/organizations that have been inquired about during the

sessions belonging to this profile. 4) Tracking Evolving User Profiles Tracking different profile events across different time periods can generate a better understanding of the

evolution of user access patterns and seasonality. Note that both profiles and click streams are typically evolving, since the profiles are nothing more than summaries of the click streams, which are themselves evolving. Each profile pi is discovered along with an automatically determined measure of scale i that represents the amount of variance or dispersion of the user sessions in a given cluster around the cluster representative. This measure is


50

used to determine the boundary around each cluster and thus allows to automatically determining whether two profiles are compatible.

Two profiles are compatible if their boundaries overlap. The notion of compatibility between profiles is essential for tracking evolving profiles. After mining the Web log of a given period, automated comparison is performed between all the profiles discovered in the current batch and the profiles discovered in the previous batch by a sequence of SQL queries on the profiles that have been stored in a database, as shown in the Track Profiles Algorithm. A typical query for retrieving corresponding profiles between Periods T1 and T11 is SELECT This Profile, TothisProfile FROM Profile Trail WHERE Period = T1.

A profile evolution event is defined as a coarse categorization of possible real evolution scenarios that relate how profiles that are discovered during a certain period relate to profiles discovered in another period.

Graph 1. Profile Evolution for Birth, persistence and Atavism

In table 1 shows the profile evolution of birth, persistence and Atavism

Table 1. Profile Evolution for Birth, persistence and Atavism

4. CONCLUSION

By analyzing user profiles, we can get the interesting pattern. In future, E-Commerce will play the crucial role in all sectors of developing country. The analysis of user profile will be useful to keep our customer for long time with their satisfaction. In the dynamic web site, user interest prediction and change the design of web site will attract the customer visits again and again. Designing adaptive web site according the user interest will be benefited to portal type website. It will improve the advertisement revenue. By clustering the user profile with enriched web log file of analysis will improve the sales of the season products like product sold Christmas, Ramjan, Diwali etc. With the growth of Web-based applications, specifically electronic commerce, there is significant interest in analyzing Web usage data to better understand Web usage, and apply the knowledge to better serve users. This has led to a number of commercial offerings for doing such a web usage analysis. This type analysis will be benefited for the CRM. Same analysis can be done in daily; weekly and bi-weekly will help more effective for customer satisfaction.

5. ACKNOWLEDGEMENTS The authors would like to thank the Director cum Secretary, Correspondent, Principal, SNS College of

Technology, and Coimbatore for their motivation and constant encouragement. The authors would like to thank the Faculty Members of Department of Computer Science and Engineering for critical review of this manuscript and for his valuable input and fruitful discussions. Also, he takes privilege in extending gratitude to his family members and friends who rendered their support throughout this research work.


51

REFERENCES 1. Cooley R, Mobasher B, and Srivastava J (1997), Web Mining: Information and Pattern

Discovery on the World Wide Web, Proc. Ninth IEEE Intl Conf. Tools with AI (ICTAI 97), pp. 558-567

2. Dai J and Mobasher, B (2002) Using Ontologies to Discover Domain- Level Web Usage Profiles, Proc. Second ECML/PKDD Semantic Web Mining Workshop.

3. Grabtree I and Soltysiak S (2004), Identifying and Tracking Changing InterestsIntl J. Digital Libraries, vol. 2, pp. 38-53.

4. Karthik, S., V.P. Arunachalam and T. Ravichandran,2008. Multi directional geographical Traceback with in directions generalization. J. Comput. Sci.,4: 646-651.http://www.scipub.org/fulltext/jcs/jcs48646-651.pdf

5. Karthik. S., V.P. Arunachalam and T. Ravichandran,2009. Analyzing interaction between denial of service (dos) attacks and threats. Int. J. Soft Computing. 4: 68-75. DOI:10.3923/ijscomp.2009.68.75

6. Karthik. S., V.P. Arunachalam and R.M.Bhavdharini, 2008. Analyzing interaction between denial of service (dos) attacks and threats. Proceeding of the IEEE International Conference on Computing,Communication and Networking, Dec. 18-20, IEEE Xplore Press, USA., pp: 1-9. DOI: 10.1109/ICCCNET.2008. 4787663

7. Koychev J (2000)Gradual Forgetting for Adaptation to Concept Drift, Proc. ECAI Workshop Current Issues in Spatio-Temporal Reasoning , pp. 101.

8. Maloof A and. Michalski R.S (1995) Learning Evolving Concepts Using Partial Memory Approach, Working Notes AAAI Fall Symp. Active Learning , pp. 70-73.

9. Nasraoui O, Krishnapuram R, Frigui H, and Joshi A (2000), Extracting Web User Profiles Using Relational Competitive Fuzzy Clustering, Intl J. Artificial Intelligence Tools, vol. 9, no. 4, pp. 509.

10. Nasraoui O and Krishnapuram R (2002), A New Evolutionary Approach to Web Usage and Context Sensitive Associations Mining, Int Journal. Computational Intelligence and Applications, special issue on Internet intelligent systems, vol. 2, no. 3, pp. 339-348.

11. Nasraoui O, Cardona C, Rojas C, and Gonzalez F (2003), Mining Evolving User Profiles in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm, Proc. Workshop Web Mining as a Premise to Effective and Intelligent Web Applications (WebKDD ), pp. 71-81.

12. Nasraoui O and Krishnapuram R (2000), A Novel Approach to Unsupervised Robust Clustering Using Genetic Niching, Proc. Ninth IEEE Intl Conf. Fuzzy Systems (FUZZ 00), pp. 170-175

13. Nasraoui Oand S. Goswami (2006), Mining and Validating Localized Frequent Itemsets with Dynamic Tolerance, Proc. Sixth SIAM Intl Conf. Data Mining (SDM 06), pp. 578-582.

14. Oberle D, Berendt B, Hotho A, and Gonzalez J (2003), Conceptual User Tracking, Proc. First Intl Atlantic Web Intelligence Conf. (AWIC ).

15. Oberle D, Berendt B, Hotho A, and Gonzalez J (2003), Conceptual User Tracking, Proc. First Intl Atlantic Web Intelligence Conf. (AWIC ).

16. Osmar R. Zaane (2002) Building a Recommender Agent for e-Learning Systems Proceedings of the International Conference on Computers in Education

17. Srivastava J, Cooley R, Deshpande M, and Tan P.N (2000) , Web Usage Mining:Discovery and Applications of Usage Patterns from Web Data, SIGKDD Explorations, vol. 1, no. 2, pp. 1-12.

18. Spiliopoulou M and Faulstich L.C (1998), WUM: A Web Utilization Miner,. Proc. First Intl Workshop Web and Databases (WebDB ).

19. Weinan Wang Osmar R. Zaane (2001) Clustering web sessions by sequence alignment In Proc. of Workshop on Web Mining in First International SIAM Conference on Data Mining, pages 4150.

20. Yan T, Jacobsen M, Garcia-Molina H, and Dayal U (1996), From User Access Patterns to Dynamic Hypertext Linking, Proc. Fifth Int World Wide Web Conf.

jmt2010_3-5_-libre (1)

Documents