real-time data cleaning and semantic enrichment of...
TRANSCRIPT
© 2016, IJARCSSE All Rights Reserved Page | 283
Volume 6, Issue 7, July 2016 ISSN: 2277 128X
International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com
Real-time Data Cleaning and Semantic Enrichment of Web
Server Logs Seema Nimbalkar
Assistant Professor, Computer Science Department, Abeda Inamdar Senior College,
Savitribai Phule Pune University, Pune, Maharashtra, India
Abstract— Web Usage mining is the application of data mining techniques to discover usage patterns from the Web,
in order to better serve the needs of web-based applications. Web usage mining has become the subject of intensive
research, as its potential for personalized services, web-site improvement & modification and usage characterization is
recognised. There are several pre-processing tasks that must be performed prior to applying data mining algorithms to
the data collected from server logs. The successful application of data mining techniques to Web usage data is highly
dependent on the correct application of these pre-processing tasks. The data pre-processing process is often the most
time consuming and computationally intensive step in the Web usage mining. The quantity of the Web usage data to
be analysed and its low quality are the principal problems in Web Usage Mining. There is a need for further research
to improve the performance and effectiveness of Data pre-processing in Web Usage Mining. The effectiveness &
efficiency of the data pre-processing step is largely dependent on the quality of data in the log files. This paper
discusses a technique of real-time data cleaning and semantic enrichment of web logs to improve the quality of the
data in log files.
Keywords— Data Pre-processing, Data cleaning, Semantic, Web Usage Mining, Web log, WUM.
I. INTRODUCTION The World Wide Web continues to grow at an astounding rate in both the sheer volume of traffic and size &
complexity of the web sites. As a result there is a challenging task for web master to understand the needs of users and
keep their attention in the web-site. Analyzing the user’s navigational behaviour can help to improve the web-site design,
performance and achieve personalization of the user. Web Usage Mining applies data mining procedures to analyze user
access of the web sites. As with any KDD (Knowledge Discovery from Databases) process, Web Usage Mining consists
of three main steps: Pre-processing, Knowledge Discovery & Analysis. The quantity of web usage data to be analyzed
and its low quality are the principal problems in Web Usage Mining. When applied to these data, the classic algorithms
don’t give proper results in terms of behaviour of the web-site’s users. According to researchers two-thirds of data
mining analysts consider that data cleaning and data preparation consume more than 60% of the total analysis time [2].
The quality of results from Data Pre-processing influences the result of pattern discovery and analysis [1]. Thus data pre-
processing is particularly important for the whole web usage mining process and is the key of its quality.
A. Web Mining Web mining is area of Data Mining which deals with the extraction of interesting knowledge from the World Wide
Web.
B. Web Content Mining It is part of Web Mining which focuses on the raw information available in web pages. The source data mainly
consist of textual data in web pages.
C. Web Structure Mining It is part of Web Mining which focuses on the structure of web sites. The source data mainly consist of the structural
information in web pages (e.g., links to other pages) Researchers have suggested various heuristics which still don’t give
accurate results.
D. Web Usage Mining It is part of Web Mining which deals with the extraction of interesting knowledge from logging information
produced by web servers. The web usage mining generally includes the following several steps: data collection, data pre-
processing, knowledge discovery and pattern analysis.
1) Data collection: The data sources for web usage mining can include web server logs, agent logs or proxy logs and
browser logs. The web server logs are in Common Log Format or Combined Log Format recommended by World Wide
Web consortium [2, 6].
Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),
July- 2016, pp. 283-290 July- 2016, pp. 283-290
© 2016, IJARCSSE All Rights Reserved Page | 284
Fig. 1 Web usage mining
2) Data pre-processing: The Data pre-processing consists of Data cleaning, User Identification, Session
Identification, Path completion and transaction identification. Data cleaning involves removing requests for irrelevant
resources. Data cleaning can result in more than 50% reduction in size of the log file. As in case of web-sites like Yahoo,
Amazon the size of log files can reach to hundreds of GB per hour, Data cleaning of such files is a complex & time
consuming process. User and Session Identification are complicated processes due to existence of proxy server, browser
caches and security & privacy issues.
3) Knowledge Discovery & Pattern analysis: Apply data-mining techniques on the pre-processed data to discover &
analyze patterns to get required information.
II. METHODOLOGY The objective of this paper is to propose technique for real-time data cleaning and semantic enrichment of web
server logs that will help to improve Web usage mining. The paper proposes an improved Web log solution that would
incorporate Data cleaning and semantic enrichment of logs. The new Web logging system will perform the following
tasks –
1) The irrelevant or redundant web requests will be logged in a separate access log file.
2) The remaining Web requests that are essential for web usage mining will be logged into the original access log
file.
3) Enrich the semantic information of the user actions that for the web requests.
The effectiveness & efficiency of the data pre-processing step is largely dependent on the quality of data in the log
files. So can we improve the quality of the web data that goes into the log file? Can we have a better web server logging
system? With such effective logging mechanism it will be possible to reduce the time & effort in the Data pre-processing
and we can even have a real-time Web usage analysis from the data collected.
III. SYSTEM DESIGN
Fig 2. Proposed Web Log System
Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),
July- 2016, pp. 283-290 July- 2016, pp. 283-290
© 2016, IJARCSSE All Rights Reserved Page | 285
A. Data Cleaning & Semantic enrichment
1) Algorithm: Step 1: Read the input http request data.
Step 2: If the request is for image, icon or any multi-media file then log the request in the redundant
access log file.
Step 3: If there is a client or server error response for the request e.g. 404 or 500 then log the
request in the redundant access log file.
Step 4: If it is 'internal dummy connection' request then log the request in the redundant access log
file.
Step 5: Lookup the request URI in the semantic model and get the semantic attributes for the
web request.
Step 6: Read the semantic attribute values from the http request and enrich the log entry for the
request with the semantic information.
Step 7: The request log entry is also enriched with user session id and application user id.
Step 8: Repeat the above steps for each http request.
IV. IMPLEMENTATION The new Web Log system will be enhancement to the existing logging mechanism in the Apache HTTP Web Server.
Apache HTTP Server is an open-source web server platform. As per the design of the web log system the irrelevant or
redundant requests will be logged in a separate log file. The remaining web requests that are essential for Web Usage
Mining will be logged into the web log file. The web log solution will be implemented using various Apache server
configuration directives and the mod_perl Apache module. The mod_perl module comes with a Perl Interpreter
embedded in the Apache Server and provides a complete access to the Apache Server API.
A. Access Log Format The LogFormat for the access.log file is defined in the main Apache server configuration file and will be modified as
below -
# Added “%{semantic}e” to LogFormat to include the semantic information of the request in the logs
# Added “%{sid}C” to the LogFormat to include the session Id from the 'sid' cookie in the logs
# Added “%{userid}e” to the LogFormat to include the user Id in the logs
LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\" \"%{semantics}e\"
\"%{sid}C\" \"%{userid}e\"" combined
B. Apache Host configuration file <Directory />
PerlOptions +GlobalRequest
Options +ExecCGI
Options FollowSymLinks
AllowOverride None
# Set the environment variable to 'redundant' for http request with error response (4xx and 5xx)
# using the RespoonseSetEnvIfPlus directive as in the below configuration snippet -
ResponseSetEnvIfPlus RESPONSE_STATUS ([4-5][0-9]+[0-9]+) redundant # The SemanticEnrichment handler implemented to get the semanic information of the http request
# and set it in the request environment. The handler is registered as a PerlInitHandler in the request
# lifecycle phase
PerlInitHandler WebLogSystem::SemanticEnrichmentHandler </Directory>
# Set the environment variable to 'redundant' for Internal dummy connection log entries of internal
# requests within the server using the SetEnvIf directive as below -
SetEnvIf Remote_Addr "::1" redundant # Set the environment variable to 'redundant' for Http requests for various image typesusing the
# SetEnvIfNoCasedirective as below -
SetEnvIfNoCase Request_URI "\.gif$" redundant
SetEnvIfNoCase Request_URI "\.jpg$" redundant
SetEnvIfNoCase Request_URI "\.png$" redundant
SetEnvIfNoCase Request_URI "\.bmp$" redundant
SetEnvIfNoCase Request_URI "\.js$" redundant
SetEnvIfNoCase Request_URI "\.css$" redundant
SetEnvIfNoCase Request_URI "\.ico$" redundant
# No change in the error.log configuration
ErrorLog ${APACHE_LOG_DIR}/error.log
# Log all the essential log entries to the original access.log file using conditional logging .i.e
# check for environment variable value is not redundant
Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),
July- 2016, pp. 283-290 July- 2016, pp. 283-290
© 2016, IJARCSSE All Rights Reserved Page | 286
CustomLog ${APACHE_LOG_DIR}/access.log combined env=!redundant # Log all the non-essential log entries to the access_redundant.log file using conditional logging .i.e
# check for environment variable value is redundant
CustomLog ${APACHE_LOG_DIR}/access_redundant.log combined_redundant env=redundant # Log format for access_redundant.log file
LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined_redundant
C. Semantic Data model mysql> use mysemanticDB;
mysql> show tables;
+------------------------+
| Tables_in_mysemanticDB |
+------------------------+
| semantic_data_model |
+------------------------+
mysql> select * from semantic_data_model;
+--------------------------+------------+
| request_url | semantics |
+--------------------------+------------+
| /tomatocart/checkout.php | action,pID |
| /tomatocart/json.php | action,pID |
+--------------------------+------------+
We are using a free and open-source shopping cart ecommerce application called TomatoCart. The above
semantic_data_model table holds the list of semantic attributes for each of the http request URIs in the tomatocart
application for which we want to get the semantic information from the requests.
In case of tomatocart application the 'action' attribute will be:
add_product – For product added to the user's shopping cart
remove_product – For product removed from the user's shopping cart
success – For checkouut & confirm order by user
and the 'pID' attribute will be for the productId added/removed from the cart.
D. Semantic Enrichment Perl Handler
1) Apache mod_perl: The Apache mod_perl module integrates Perl with the Apache web server. With the mod_perl
module it is possible to write Apache modules entirely in Perl. mod_perl embeds a Perl interpreter within each Apache
httpd process, giving you the ability to write handlers in Perl instead of C. mod_perl provides various handlers that can
be hooked into the http request processing phases.
2) mod_perl Handler: We will be implementing a mod_perl handler to enrich the semantic information in the web
access log file. The Perl handler will be configured in the host configuration file of the Apache Web Server as a
PerlInitHandler. Below is the implementation of the SemanticEnrichmentHandler which has been registered as a
PerlInitHandler in the Apache host configuration file.
#file:WebLogSystem/SemanticEnrichmentHandler.pm
#---------------------------
package WebLogSystem::SemanticEnrichmentHandler;
use strict;
use warnings;
use Apache2::RequestRec ();
use Apache2::Request ();
use APR::Request ();
use Apache2::Cookie ();
use Apache2::Const -compile => qw(OK);
use DBI;
sub handler {
my $r = shift;
my $dbh = DBI->connect('dbi:mysql:mysemanticDB','root','root')
or die "Connection Error: $DBI::errstr\n";
my $sql = "select request_url, semantics from semantic_data_model";
my $sth = $dbh->prepare($sql);
$sth->execute or die "SQL Error: $DBI::errstr\n";
my %semanticsMap;
my @row;
while (@row = $sth->fetchrow_array) {
$semanticsMap{$row[0]} = $row[1];
}
Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),
July- 2016, pp. 283-290 July- 2016, pp. 283-290
© 2016, IJARCSSE All Rights Reserved Page | 287
my $requestURI = $r->uri;
my $semanticKeys = $semanticsMap{$requestURI};
my @semanticKeysArr = split(',', $semanticKeys);
my $req = APR::Request::Apache2->handle($r);
my @reqParams = $req->param;
my %params;
for my $param (@reqParams) {
my $paramValue = $req->param($param);
$params{$param} = $paramValue;
}
my $semantics;
my @keys = keys %params;
my $action;
foreach my $key (@semanticKeysArr) {
if ($key eq 'action') {
my $param = $params{$key};
if ($param) {
} elsif (scalar(@keys) >= 1) {
$param = $keys[0];
}
if ($param eq 'add_product' || $param eq 'cart_remove' || $param eq
'remove_product' || $param eq 'success') {
$action = $param;
$semantics = "action=$action";
}
delete $params{$key};
}
}
if ($action && ($action eq 'add_product' || $action eq 'cart_remove' || $action eq
'remove_product')){
foreach my $key (@semanticKeysArr) {
if ($key eq 'pID' || $key eq 'pQty') {
my $param = $params{$key};
if ($param) {
$semantics = "$semantics,$key=$param";
} elsif (scalar(@keys) >= 1) {
$semantics = "$semantics,$key=$keys[0]";
}
delete $params{$key};
}
}
}
$r->subprocess_env(semantics => $semantics);
my $cookiesJar = Apache2::Cookie::Jar->new($r);
my $sidCookie = $cookiesJar->cookies("sid");
my $userId;
if ($sidCookie) {
my $sessionId = $sidCookie->value();
my $toc_dbh = DBI->connect('dbi:mysql:tomatocart','root','root')
or die "Connection Error: $DBI::errstr\n";
my $toc_sql = "select customer_id from toc_whos_online where session_id =
'$sessionId'";
my $toc_sth = $toc_dbh->prepare($toc_sql);
$toc_sth->execute or die "SQL Error: $DBI::errstr\n";
my @toc_row;
if (@toc_row = $toc_sth->fetchrow_array) {
$userId = $toc_row[0];
}
}
$r->subprocess_env(userid => $userId);
return Apache2::Const::OK;
}
1;
Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),
July- 2016, pp. 283-290 July- 2016, pp. 283-290
© 2016, IJARCSSE All Rights Reserved Page | 288
The implementation of the above handler can be described as below
1) Connect to the mysemanticDB database and read the list of semantic attributes from the semantic_data_model
table for the request URI.
2) Read all the request parameters from the http request and store them as key-values in a hash map.
3) For each semantic attribute
Get the value of the semantic attribute from the request parameters.
Add the attribute name-value pair to the semantic info variable.
4) Set the semantic information in a server environment variable and name it as 'semantics'.
5) Read the 'sid' session cookie from the http request and get the user session id.
6) Connect to the 'tomatocart' application database.
7) Get the userId by querying the application database using the session id.
8) Set the user Id in the 'userid' environment variable.
The userid and the sessionid are read by the Apache logging routine and recorded in the access log entry.
V. RESULT
A. TomatoCart application We are using the TomatoCart e-commerce shopping cart application for demonstration purposes in this research.
This is a free and open-source application. The application has been installed in the Apache Web server.
Logged-in with a user '[email protected]' to the application and add products to the shopping cart
Fig 3. Tomatocart – Customer login
Then check-out the products from the cart and complete the order.
Fig 4. Tomatocart – Customer shopping cart
Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),
July- 2016, pp. 283-290 July- 2016, pp. 283-290
© 2016, IJARCSSE All Rights Reserved Page | 289
User successfully completed the order.
Fig 5. Tomatocart – Customer Order
For each of the requests corresponding to the user actions for add product & confirm order, the perl handler will
enrich the semantic information namely the action and product id in the logs. The session id and user id are also add to
the log entries as seen in the screen-shot below of the access.log file-
Fig 6. TomatoCart – Apache access.log
All the other non-essential requests like images, error response & internal dummy connections are logged in the
access_redundant.log file as in the screen-shot below -
Fig 7. TomatoCart – Apache redundant access.log
Nimbalkar International Journal of Advanced Research in Computer Science and Software Engineering 6(7),
July- 2016, pp. 283-290 July- 2016, pp. 283-290
© 2016, IJARCSSE All Rights Reserved Page | 290
VI. CONCLUSION
Web Usage mining is a rather recent research field and it corresponds to the process of knowledge discovery from
databases (KDD) applied to web usage data. However the pre-processing step which is the most complex and time-
consuming has received less attention from the research community. There is a need for more research on the Data Pre-
processing task, to get better quality of pre-processing results in optimum time. The proposed paper discusses a Data Pre-
processing technique by using an effective web server logging system. The new logging mechanism will incorporate
some of the steps of Data Pre-processing and thus help to bring down the time required for Data Pre-processing.
ACKNOWLEDGEMENT I would like to take this opportunity to express my profound gratitude and deep regard to my Guide Prof. Dr. Suhas
Patil, for her exemplary guidance, valuable feedback and constant encouragement throughout the duration of my
research. His valuable suggestions were of immense help throughout my research work. His perceptive criticism kept me
working to make this research work in a much better way. Working under him was an extremely knowledgeable
experience for me.
REFERENCES [1] Li Chaofeng. “Research and Development of DataPreprocessing in Web Usage Mining.”International
Conference on Management Science and Engineering, 2006.
[2] Doru Tanasa and Brigitte Trousse.” Advanced Data Preprocessing for Intersites Web Usage Mining”.
Published by the IEEE Computer Society, MARCH/APRIL 2004.
[3] Robert Cooley, Bamshed Mobasher & Jaideep Srivastava,.” Data Preparation for Mining World Wide Web
Browsing Patterns”.
[4] Myra Spiliopoulou, Bamshed Mobasher, Bettina Berendt, Miki Nakagawa. “A Framework for the Evaluation of
Session Reconstruction” Heuristics in Web Usage Analysis.
[5] Yan Li, Bo-qin Feng, Yan Li.” The Construction of Transactions for Web Usage Mining”. Published by IEEE
Computer Society, 2009.
[6] K. R. Suneetha, Dr. R. Krishnamoorthi.”Identifying User Behavior by Analyzing Web Serve Access Log file”.
International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009.
[7] Ms. Dipa Dixit, Mr.Jayant Gadge.” Automatic Recommendation for Online Users Using Web Usage Mining”.
International Journal of Managing Information Technology (IJMIT) Vol.2, No.3, August 2010.
[8] V. Sathiyamoorthi, Dr. V. Murali Bhaskaran. “Data Preparation techniques for Web Usage Mining in World
Wide Web” – An Approach. Internal Journal of Recent Trends in Engineering, Vol.2, No.4, Nov 2009.
[9] Yan LI, Boqin FENG, Qinjiao MAO .” Research on Path Completion Technique in Web Usage Mining”.
International Symposium on Computer Science and Computational Technology. 2008
[10] Semantic Web Usage Mining – Overview and Case studies. Bettina Berendt, Humboldt University,Germany.
[11] W3C Common Log Format. http://www.w3.org/Daemon/User/Config/Logging.h tml#common- logfile-format
[12] W3C Extended Log Format. http://www.w3.org/TR/WD-logfile.html
[13] P. Nithya, Dr. P. Sumathi “Novel Pre-Processing Technique for Web Log Mining by Removing Global
Noise and Web Robots”, IEEE Conference Publications, 2012.
[14] Chintan R. Varnagar, Nirali N. Madhak, Trupti M. Kodinariya, Jayesh N. Rathod “Web Usage Mining: A
Review on Process, Methods and Techniques” IEEE Conference Publications, pp: 40-46, 2013.
[15] Ma Shuyue, Liu Wencai, Wang Shuo,” The Study on the Preprocessing in Web Log Mining”, IEEE
Conference Publications, pp: 1-5, 2011.
[16] Liu Jian, Wang Yan-Qing, “Web Log Data Mining Based on Association Rule”, Eighth International
Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp: 1855-1859, 2011.
[17] Izwan Nizal Mohd, Shaharanee, Fedja Hadzic, Tharam S. Dillon, “Interestingness measures for association
rules based on statistical validity”, Knowledge-Based Systems, pp: 386–392, 2011.
[18] P. Nithya, Dr. P. Sumathi “Novel Pre-Processing Technique for Web Log Mining by Removing Global Noise
and Web Robots”, IEEE Conference Publications, 2012.
[19] Chintan R. Varnagar, Nirali N. Madhak, Trupti M. Kodinariya, Jayesh N. Rathod “Web Usage Mining: A
Review on Process, Methods and Techniques” IEEE Conference Publications, pp: 40-46, 2013.
[20] Johannes K. Chiang, Rui-Han Yang, “Multidimensional Data Mining for Discover Association Rules in
Various Granularities”, IEEE Conference Publications, pp: 1-6, 2013.