real-time data cleaning and semantic enrichment of...

8
© 2016, IJARCSSE All Rights Reserved Page | 283 Volume 6, Issue 7, July 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real-time Data Cleaning and Semantic Enrichment of Web Server Logs Seema Nimbalkar Assistant Professor, Computer Science Department, Abeda Inamdar Senior College, Savitribai Phule Pune University, Pune, Maharashtra, India AbstractWeb Usage mining is the application of data mining techniques to discover usage patterns from the Web, in order to better serve the needs of web-based applications. Web usage mining has become the subject of intensive research, as its potential for personalized services, web-site improvement & modification and usage characterization is recognised. There are several pre-processing tasks that must be performed prior to applying data mining algorithms to the data collected from server logs. The successful application of data mining techniques to Web usage data is highly dependent on the correct application of these pre-processing tasks. The data pre-processing process is often the most time consuming and computationally intensive step in the Web usage mining. The quantity of the Web usage data to be analysed and its low quality are the principal problems in Web Usage Mining. There is a need for further research to improve the performance and effectiveness of Data pre-processing in Web Usage Mining. The effectiveness & efficiency of the data pre-processing step is largely dependent on the quality of data in the log files. This paper discusses a technique of real-time data cleaning and semantic enrichment of web logs to improve the quality of the data in log files. KeywordsData Pre-processing, Data cleaning, Semantic, Web Usage Mining, Web log, WUM. I. INTRODUCTION The World Wide Web continues to grow at an astounding rate in both the sheer volume of traffic and size & complexity of the web sites. As a result there is a challenging task for web master to understand the needs of users and keep their attention in the web-site. Analyzing the user’s navigational behaviour can help to improve the web-site design, performance and achieve personalization of the user. Web Usage Mining applies data mining procedures to analyze user access of the web sites. As with any KDD (Knowledge Discovery from Databases) process, Web Usage Mining consists of three main steps: Pre-processing, Knowledge Discovery & Analysis. The quantity of web usage data to be analyzed and its low quality are the principal problems in Web Usage Mining. When applied to these data, the classic algorithms don’t give proper results in terms of behaviour of the web-site’s users. According to researchers two-thirds of data mining analysts consider that data cleaning and data preparation consume more than 60% of the total analysis time [2]. The quality of results from Data Pre-processing influences the result of pattern discovery and analysis [1]. Thus data pre- processing is particularly important for the whole web usage mining process and is the key of its quality. A. Web Mining Web mining is area of Data Mining which deals with the extraction of interesting knowledge from the World Wide Web. B. Web Content Mining It is part of Web Mining which focuses on the raw information available in web pages. The source data mainly consist of textual data in web pages. C. Web Structure Mining It is part of Web Mining which focuses on the structure of web sites. The source data mainly consist of the structural information in web pages (e.g., links to other pages) Researchers have suggested various heuristics which still don’t give accurate results. D. Web Usage Mining It is part of Web Mining which deals with the extraction of interesting knowledge from logging information produced by web servers. The web usage mining generally includes the following several steps: data collection, data pre- processing, knowledge discovery and pattern analysis. 1) Data collection: The data sources for web usage mining can include web server logs, agent logs or proxy logs and browser logs. The web server logs are in Common Log Format or Combined Log Format recommended by World Wide Web consortium [2, 6].

Upload: vuongque

Post on 14-Oct-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

© 2016, IJARCSSE All Rights Reserved Page | 283

Volume 6, Issue 7, July 2016 ISSN: 2277 128X

International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com

Real-time Data Cleaning and Semantic Enrichment of Web

Server Logs Seema Nimbalkar

Assistant Professor, Computer Science Department, Abeda Inamdar Senior College,

Savitribai Phule Pune University, Pune, Maharashtra, India

Abstract— Web Usage mining is the application of data mining techniques to discover usage patterns from the Web,

in order to better serve the needs of web-based applications. Web usage mining has become the subject of intensive

research, as its potential for personalized services, web-site improvement & modification and usage characterization is

recognised. There are several pre-processing tasks that must be performed prior to applying data mining algorithms to

the data collected from server logs. The successful application of data mining techniques to Web usage data is highly

dependent on the correct application of these pre-processing tasks. The data pre-processing process is often the most

time consuming and computationally intensive step in the Web usage mining. The quantity of the Web usage data to

be analysed and its low quality are the principal problems in Web Usage Mining. There is a need for further research

to improve the performance and effectiveness of Data pre-processing in Web Usage Mining. The effectiveness &

efficiency of the data pre-processing step is largely dependent on the quality of data in the log files. This paper

discusses a technique of real-time data cleaning and semantic enrichment of web logs to improve the quality of the

data in log files.

Keywords— Data Pre-processing, Data cleaning, Semantic, Web Usage Mining, Web log, WUM.

I. INTRODUCTION The World Wide Web continues to grow at an astounding rate in both the sheer volume of traffic and size &

complexity of the web sites. As a result there is a challenging task for web master to understand the needs of users and

keep their attention in the web-site. Analyzing the user’s navigational behaviour can help to improve the web-site design,

performance and achieve personalization of the user. Web Usage Mining applies data mining procedures to analyze user

access of the web sites. As with any KDD (Knowledge Discovery from Databases) process, Web Usage Mining consists

of three main steps: Pre-processing, Knowledge Discovery & Analysis. The quantity of web usage data to be analyzed

and its low quality are the principal problems in Web Usage Mining. When applied to these data, the classic algorithms

don’t give proper results in terms of behaviour of the web-site’s users. According to researchers two-thirds of data

mining analysts consider that data cleaning and data preparation consume more than 60% of the total analysis time [2].

The quality of results from Data Pre-processing influences the result of pattern discovery and analysis [1]. Thus data pre-

processing is particularly important for the whole web usage mining process and is the key of its quality.

A. Web Mining Web mining is area of Data Mining which deals with the extraction of interesting knowledge from the World Wide

Web.

B. Web Content Mining It is part of Web Mining which focuses on the raw information available in web pages. The source data mainly

consist of textual data in web pages.

C. Web Structure Mining It is part of Web Mining which focuses on the structure of web sites. The source data mainly consist of the structural

information in web pages (e.g., links to other pages) Researchers have suggested various heuristics which still don’t give

accurate results.

D. Web Usage Mining It is part of Web Mining which deals with the extraction of interesting knowledge from logging information

produced by web servers. The web usage mining generally includes the following several steps: data collection, data pre-

processing, knowledge discovery and pattern analysis.

1) Data collection: The data sources for web usage mining can include web server logs, agent logs or proxy logs and

browser logs. The web server logs are in Common Log Format or Combined Log Format recommended by World Wide

Web consortium [2, 6].

Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),

July- 2016, pp. 283-290 July- 2016, pp. 283-290

© 2016, IJARCSSE All Rights Reserved Page | 284

Fig. 1 Web usage mining

2) Data pre-processing: The Data pre-processing consists of Data cleaning, User Identification, Session

Identification, Path completion and transaction identification. Data cleaning involves removing requests for irrelevant

resources. Data cleaning can result in more than 50% reduction in size of the log file. As in case of web-sites like Yahoo,

Amazon the size of log files can reach to hundreds of GB per hour, Data cleaning of such files is a complex & time

consuming process. User and Session Identification are complicated processes due to existence of proxy server, browser

caches and security & privacy issues.

3) Knowledge Discovery & Pattern analysis: Apply data-mining techniques on the pre-processed data to discover &

analyze patterns to get required information.

II. METHODOLOGY The objective of this paper is to propose technique for real-time data cleaning and semantic enrichment of web

server logs that will help to improve Web usage mining. The paper proposes an improved Web log solution that would

incorporate Data cleaning and semantic enrichment of logs. The new Web logging system will perform the following

tasks –

1) The irrelevant or redundant web requests will be logged in a separate access log file.

2) The remaining Web requests that are essential for web usage mining will be logged into the original access log

file.

3) Enrich the semantic information of the user actions that for the web requests.

The effectiveness & efficiency of the data pre-processing step is largely dependent on the quality of data in the log

files. So can we improve the quality of the web data that goes into the log file? Can we have a better web server logging

system? With such effective logging mechanism it will be possible to reduce the time & effort in the Data pre-processing

and we can even have a real-time Web usage analysis from the data collected.

III. SYSTEM DESIGN

Fig 2. Proposed Web Log System

Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),

July- 2016, pp. 283-290 July- 2016, pp. 283-290

© 2016, IJARCSSE All Rights Reserved Page | 285

A. Data Cleaning & Semantic enrichment

1) Algorithm: Step 1: Read the input http request data.

Step 2: If the request is for image, icon or any multi-media file then log the request in the redundant

access log file.

Step 3: If there is a client or server error response for the request e.g. 404 or 500 then log the

request in the redundant access log file.

Step 4: If it is 'internal dummy connection' request then log the request in the redundant access log

file.

Step 5: Lookup the request URI in the semantic model and get the semantic attributes for the

web request.

Step 6: Read the semantic attribute values from the http request and enrich the log entry for the

request with the semantic information.

Step 7: The request log entry is also enriched with user session id and application user id.

Step 8: Repeat the above steps for each http request.

IV. IMPLEMENTATION The new Web Log system will be enhancement to the existing logging mechanism in the Apache HTTP Web Server.

Apache HTTP Server is an open-source web server platform. As per the design of the web log system the irrelevant or

redundant requests will be logged in a separate log file. The remaining web requests that are essential for Web Usage

Mining will be logged into the web log file. The web log solution will be implemented using various Apache server

configuration directives and the mod_perl Apache module. The mod_perl module comes with a Perl Interpreter

embedded in the Apache Server and provides a complete access to the Apache Server API.

A. Access Log Format The LogFormat for the access.log file is defined in the main Apache server configuration file and will be modified as

below -

# Added “%{semantic}e” to LogFormat to include the semantic information of the request in the logs

# Added “%{sid}C” to the LogFormat to include the session Id from the 'sid' cookie in the logs

# Added “%{userid}e” to the LogFormat to include the user Id in the logs

LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\" \"%{semantics}e\"

\"%{sid}C\" \"%{userid}e\"" combined

B. Apache Host configuration file <Directory />

PerlOptions +GlobalRequest

Options +ExecCGI

Options FollowSymLinks

AllowOverride None

# Set the environment variable to 'redundant' for http request with error response (4xx and 5xx)

# using the RespoonseSetEnvIfPlus directive as in the below configuration snippet -

ResponseSetEnvIfPlus RESPONSE_STATUS ([4-5][0-9]+[0-9]+) redundant # The SemanticEnrichment handler implemented to get the semanic information of the http request

# and set it in the request environment. The handler is registered as a PerlInitHandler in the request

# lifecycle phase

PerlInitHandler WebLogSystem::SemanticEnrichmentHandler </Directory>

# Set the environment variable to 'redundant' for Internal dummy connection log entries of internal

# requests within the server using the SetEnvIf directive as below -

SetEnvIf Remote_Addr "::1" redundant # Set the environment variable to 'redundant' for Http requests for various image typesusing the

# SetEnvIfNoCasedirective as below -

SetEnvIfNoCase Request_URI "\.gif$" redundant

SetEnvIfNoCase Request_URI "\.jpg$" redundant

SetEnvIfNoCase Request_URI "\.png$" redundant

SetEnvIfNoCase Request_URI "\.bmp$" redundant

SetEnvIfNoCase Request_URI "\.js$" redundant

SetEnvIfNoCase Request_URI "\.css$" redundant

SetEnvIfNoCase Request_URI "\.ico$" redundant

# No change in the error.log configuration

ErrorLog ${APACHE_LOG_DIR}/error.log

# Log all the essential log entries to the original access.log file using conditional logging .i.e

# check for environment variable value is not redundant

Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),

July- 2016, pp. 283-290 July- 2016, pp. 283-290

© 2016, IJARCSSE All Rights Reserved Page | 286

CustomLog ${APACHE_LOG_DIR}/access.log combined env=!redundant # Log all the non-essential log entries to the access_redundant.log file using conditional logging .i.e

# check for environment variable value is redundant

CustomLog ${APACHE_LOG_DIR}/access_redundant.log combined_redundant env=redundant # Log format for access_redundant.log file

LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined_redundant

C. Semantic Data model mysql> use mysemanticDB;

mysql> show tables;

+------------------------+

| Tables_in_mysemanticDB |

+------------------------+

| semantic_data_model |

+------------------------+

mysql> select * from semantic_data_model;

+--------------------------+------------+

| request_url | semantics |

+--------------------------+------------+

| /tomatocart/checkout.php | action,pID |

| /tomatocart/json.php | action,pID |

+--------------------------+------------+

We are using a free and open-source shopping cart ecommerce application called TomatoCart. The above

semantic_data_model table holds the list of semantic attributes for each of the http request URIs in the tomatocart

application for which we want to get the semantic information from the requests.

In case of tomatocart application the 'action' attribute will be:

add_product – For product added to the user's shopping cart

remove_product – For product removed from the user's shopping cart

success – For checkouut & confirm order by user

and the 'pID' attribute will be for the productId added/removed from the cart.

D. Semantic Enrichment Perl Handler

1) Apache mod_perl: The Apache mod_perl module integrates Perl with the Apache web server. With the mod_perl

module it is possible to write Apache modules entirely in Perl. mod_perl embeds a Perl interpreter within each Apache

httpd process, giving you the ability to write handlers in Perl instead of C. mod_perl provides various handlers that can

be hooked into the http request processing phases.

2) mod_perl Handler: We will be implementing a mod_perl handler to enrich the semantic information in the web

access log file. The Perl handler will be configured in the host configuration file of the Apache Web Server as a

PerlInitHandler. Below is the implementation of the SemanticEnrichmentHandler which has been registered as a

PerlInitHandler in the Apache host configuration file.

#file:WebLogSystem/SemanticEnrichmentHandler.pm

#---------------------------

package WebLogSystem::SemanticEnrichmentHandler;

use strict;

use warnings;

use Apache2::RequestRec ();

use Apache2::Request ();

use APR::Request ();

use Apache2::Cookie ();

use Apache2::Const -compile => qw(OK);

use DBI;

sub handler {

my $r = shift;

my $dbh = DBI->connect('dbi:mysql:mysemanticDB','root','root')

or die "Connection Error: $DBI::errstr\n";

my $sql = "select request_url, semantics from semantic_data_model";

my $sth = $dbh->prepare($sql);

$sth->execute or die "SQL Error: $DBI::errstr\n";

my %semanticsMap;

my @row;

while (@row = $sth->fetchrow_array) {

$semanticsMap{$row[0]} = $row[1];

}

Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),

July- 2016, pp. 283-290 July- 2016, pp. 283-290

© 2016, IJARCSSE All Rights Reserved Page | 287

my $requestURI = $r->uri;

my $semanticKeys = $semanticsMap{$requestURI};

my @semanticKeysArr = split(',', $semanticKeys);

my $req = APR::Request::Apache2->handle($r);

my @reqParams = $req->param;

my %params;

for my $param (@reqParams) {

my $paramValue = $req->param($param);

$params{$param} = $paramValue;

}

my $semantics;

my @keys = keys %params;

my $action;

foreach my $key (@semanticKeysArr) {

if ($key eq 'action') {

my $param = $params{$key};

if ($param) {

} elsif (scalar(@keys) >= 1) {

$param = $keys[0];

}

if ($param eq 'add_product' || $param eq 'cart_remove' || $param eq

'remove_product' || $param eq 'success') {

$action = $param;

$semantics = "action=$action";

}

delete $params{$key};

}

}

if ($action && ($action eq 'add_product' || $action eq 'cart_remove' || $action eq

'remove_product')){

foreach my $key (@semanticKeysArr) {

if ($key eq 'pID' || $key eq 'pQty') {

my $param = $params{$key};

if ($param) {

$semantics = "$semantics,$key=$param";

} elsif (scalar(@keys) >= 1) {

$semantics = "$semantics,$key=$keys[0]";

}

delete $params{$key};

}

}

}

$r->subprocess_env(semantics => $semantics);

my $cookiesJar = Apache2::Cookie::Jar->new($r);

my $sidCookie = $cookiesJar->cookies("sid");

my $userId;

if ($sidCookie) {

my $sessionId = $sidCookie->value();

my $toc_dbh = DBI->connect('dbi:mysql:tomatocart','root','root')

or die "Connection Error: $DBI::errstr\n";

my $toc_sql = "select customer_id from toc_whos_online where session_id =

'$sessionId'";

my $toc_sth = $toc_dbh->prepare($toc_sql);

$toc_sth->execute or die "SQL Error: $DBI::errstr\n";

my @toc_row;

if (@toc_row = $toc_sth->fetchrow_array) {

$userId = $toc_row[0];

}

}

$r->subprocess_env(userid => $userId);

return Apache2::Const::OK;

}

1;

Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),

July- 2016, pp. 283-290 July- 2016, pp. 283-290

© 2016, IJARCSSE All Rights Reserved Page | 288

The implementation of the above handler can be described as below

1) Connect to the mysemanticDB database and read the list of semantic attributes from the semantic_data_model

table for the request URI.

2) Read all the request parameters from the http request and store them as key-values in a hash map.

3) For each semantic attribute

Get the value of the semantic attribute from the request parameters.

Add the attribute name-value pair to the semantic info variable.

4) Set the semantic information in a server environment variable and name it as 'semantics'.

5) Read the 'sid' session cookie from the http request and get the user session id.

6) Connect to the 'tomatocart' application database.

7) Get the userId by querying the application database using the session id.

8) Set the user Id in the 'userid' environment variable.

The userid and the sessionid are read by the Apache logging routine and recorded in the access log entry.

V. RESULT

A. TomatoCart application We are using the TomatoCart e-commerce shopping cart application for demonstration purposes in this research.

This is a free and open-source application. The application has been installed in the Apache Web server.

Logged-in with a user '[email protected]' to the application and add products to the shopping cart

Fig 3. Tomatocart – Customer login

Then check-out the products from the cart and complete the order.

Fig 4. Tomatocart – Customer shopping cart

Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),

July- 2016, pp. 283-290 July- 2016, pp. 283-290

© 2016, IJARCSSE All Rights Reserved Page | 289

User successfully completed the order.

Fig 5. Tomatocart – Customer Order

For each of the requests corresponding to the user actions for add product & confirm order, the perl handler will

enrich the semantic information namely the action and product id in the logs. The session id and user id are also add to

the log entries as seen in the screen-shot below of the access.log file-

Fig 6. TomatoCart – Apache access.log

All the other non-essential requests like images, error response & internal dummy connections are logged in the

access_redundant.log file as in the screen-shot below -

Fig 7. TomatoCart – Apache redundant access.log

Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),

July- 2016, pp. 283-290 July- 2016, pp. 283-290

© 2016, IJARCSSE All Rights Reserved Page | 290

VI. CONCLUSION

Web Usage mining is a rather recent research field and it corresponds to the process of knowledge discovery from

databases (KDD) applied to web usage data. However the pre-processing step which is the most complex and time-

consuming has received less attention from the research community. There is a need for more research on the Data Pre-

processing task, to get better quality of pre-processing results in optimum time. The proposed paper discusses a Data Pre-

processing technique by using an effective web server logging system. The new logging mechanism will incorporate

some of the steps of Data Pre-processing and thus help to bring down the time required for Data Pre-processing.

ACKNOWLEDGEMENT I would like to take this opportunity to express my profound gratitude and deep regard to my Guide Prof. Dr. Suhas

Patil, for her exemplary guidance, valuable feedback and constant encouragement throughout the duration of my

research. His valuable suggestions were of immense help throughout my research work. His perceptive criticism kept me

working to make this research work in a much better way. Working under him was an extremely knowledgeable

experience for me.

REFERENCES [1] Li Chaofeng. “Research and Development of DataPreprocessing in Web Usage Mining.”International

Conference on Management Science and Engineering, 2006.

[2] Doru Tanasa and Brigitte Trousse.” Advanced Data Preprocessing for Intersites Web Usage Mining”.

Published by the IEEE Computer Society, MARCH/APRIL 2004.

[3] Robert Cooley, Bamshed Mobasher & Jaideep Srivastava,.” Data Preparation for Mining World Wide Web

Browsing Patterns”.

[4] Myra Spiliopoulou, Bamshed Mobasher, Bettina Berendt, Miki Nakagawa. “A Framework for the Evaluation of

Session Reconstruction” Heuristics in Web Usage Analysis.

[5] Yan Li, Bo-qin Feng, Yan Li.” The Construction of Transactions for Web Usage Mining”. Published by IEEE

Computer Society, 2009.

[6] K. R. Suneetha, Dr. R. Krishnamoorthi.”Identifying User Behavior by Analyzing Web Serve Access Log file”.

International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009.

[7] Ms. Dipa Dixit, Mr.Jayant Gadge.” Automatic Recommendation for Online Users Using Web Usage Mining”.

International Journal of Managing Information Technology (IJMIT) Vol.2, No.3, August 2010.

[8] V. Sathiyamoorthi, Dr. V. Murali Bhaskaran. “Data Preparation techniques for Web Usage Mining in World

Wide Web” – An Approach. Internal Journal of Recent Trends in Engineering, Vol.2, No.4, Nov 2009.

[9] Yan LI, Boqin FENG, Qinjiao MAO .” Research on Path Completion Technique in Web Usage Mining”.

International Symposium on Computer Science and Computational Technology. 2008

[10] Semantic Web Usage Mining – Overview and Case studies. Bettina Berendt, Humboldt University,Germany.

[11] W3C Common Log Format. http://www.w3.org/Daemon/User/Config/Logging.h tml#common- logfile-format

[12] W3C Extended Log Format. http://www.w3.org/TR/WD-logfile.html

[13] P. Nithya, Dr. P. Sumathi “Novel Pre-Processing Technique for Web Log Mining by Removing Global

Noise and Web Robots”, IEEE Conference Publications, 2012.

[14] Chintan R. Varnagar, Nirali N. Madhak, Trupti M. Kodinariya, Jayesh N. Rathod “Web Usage Mining: A

Review on Process, Methods and Techniques” IEEE Conference Publications, pp: 40-46, 2013.

[15] Ma Shuyue, Liu Wencai, Wang Shuo,” The Study on the Preprocessing in Web Log Mining”, IEEE

Conference Publications, pp: 1-5, 2011.

[16] Liu Jian, Wang Yan-Qing, “Web Log Data Mining Based on Association Rule”, Eighth International

Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp: 1855-1859, 2011.

[17] Izwan Nizal Mohd, Shaharanee, Fedja Hadzic, Tharam S. Dillon, “Interestingness measures for association

rules based on statistical validity”, Knowledge-Based Systems, pp: 386–392, 2011.

[18] P. Nithya, Dr. P. Sumathi “Novel Pre-Processing Technique for Web Log Mining by Removing Global Noise

and Web Robots”, IEEE Conference Publications, 2012.

[19] Chintan R. Varnagar, Nirali N. Madhak, Trupti M. Kodinariya, Jayesh N. Rathod “Web Usage Mining: A

Review on Process, Methods and Techniques” IEEE Conference Publications, pp: 40-46, 2013.

[20] Johannes K. Chiang, Rui-Han Yang, “Multidimensional Data Mining for Discover Association Rules in

Various Granularities”, IEEE Conference Publications, pp: 1-6, 2013.