analysing large web log files in a hadoop distributed cluster

5
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bangalore, India. [email protected], [email protected] Abstract Analysing web log files has become an important task for E-Commerce companies to predict their customer behaviour and to improve their business. Each click in an E-commerce web page creates 100 bytes of data. Large E-Commerce websites like flipkart.com, amazon.in and ebay.in are visited millions of customers simultaneously. As a result, these customers generate petabytes of data in their web log files. As the web log file size is huge we require parallel processing and reliable data storage system for processing the web log files. Both the requirements are provided by Hadoop framework. Hadoop provides Hadoop Distributed File System (HDFS) and MapReduce programming model for processing huge dataset efficiently and effectively. In this paper, NASA web log file is analysed and the total number of hits received by each web page in a website, the total number of hits received by a web site in each hour using Hadoop framework is calculated and it is shown that Hadoop framework takes less response time to produce accurate results. Keywords - Hadoop, MapReduce, Log Files, Parallel Processing, Hadoop Distributed File System, E- Commerce 1. Introduction E-Commerce is a rapidly growing industry all over the world. The biggest challenge for most E- Commerce businesses is to collect, store, analyse and organize data from multiple data sources. There’s certainly a lot of data waiting to be analysed and it is a daunting task for some E-Commerce businesses to make sense of it all [1]. One kind of data that has to be analysed in E-Commerce business is web log file. Web log file contains the following details: The IP address of the computer making the request (i.e. the visitor), the date and time of the hit, the request method, the location and name of the requested file, the HTTP status code, the size of the requested file and etc. Mining the web log file will be always helpful to E-Commerce companies to increase their profits. Because when E-Commerce companies mine the web log file they can predict the behaviour of their online customers. Mining the web log file is called Web Usage Mining. By predicting, E-Commerce companies can offer an online customer a personalized experience, including content and promotions. Also, they can provide product recommendations to customers based on their browsing behaviour. E-Commerce companies can do a lot more by mining the web log file. As the number of customers visiting E-Commerce web sites are increasing the size of the web log file is also increasing and nowadays the size of web log file is in petabytes. There are already pattern discovery data mining techniques available to analyse the web log files. These data mining techniques store web log file in traditional DBMS and analyse. But in the current scenario, the number of online customers’ increases day by day and each click from a web page creates on the order of 100 bytes of data in a typical website log file [2]. Consequently, large websites handling millions of simultaneous visitors can generate hundreds of petabytes of logs per day. For example, eBay processes petabytes of data stored in web log file to create a better shopping experience. So, to analyse such a big web log file efficiently and effectively, we need to develop faster, efficient and effective parallel and scalable data mining algorithms. Also, we need a cluster of storage devices to store a petabyte of web log data and parallel computing model for analysing. Hadoop framework provides reliable cluster of storage facility to keep our large web log file data in a distributed manner and parallel processing feature to process a large web log file data efficiently and effectively. The remainder of the paper is organized as follows. Section 2 summarizes S Saravanan et al, Int.J.Computer Technology & Applications,Vol 5 (5),1677-1681 IJCTA | Sept-Oct 2014 Available [email protected] 1677 ISSN:2229-6093

Upload: vankhue

Post on 05-Jan-2017

236 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Analysing Large Web Log Files in a Hadoop Distributed Cluster

Analysing Large Web Log Files in a Hadoop Distributed Cluster

Environment

S Saravanan, B Uma Maheswari

Department of Computer Science and Engineering,

Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bangalore, India.

[email protected], [email protected]

Abstract

Analysing web log files has become an important task

for E-Commerce companies to predict their customer

behaviour and to improve their business. Each click in

an E-commerce web page creates 100 bytes of data.

Large E-Commerce websites like flipkart.com,

amazon.in and ebay.in are visited millions of

customers simultaneously. As a result, these

customers generate petabytes of data in their web log

files. As the web log file size is huge we require

parallel processing and reliable data storage system

for processing the web log files. Both the

requirements are provided by Hadoop framework.

Hadoop provides Hadoop Distributed File System

(HDFS) and MapReduce programming model for

processing huge dataset efficiently and effectively. In

this paper, NASA web log file is analysed and the total

number of hits received by each web page in a

website, the total number of hits received by a web

site in each hour using Hadoop framework is

calculated and it is shown that Hadoop framework

takes less response time to produce accurate results.

Keywords - Hadoop, MapReduce, Log Files, Parallel

Processing, Hadoop Distributed File System, E-

Commerce

1. Introduction E-Commerce is a rapidly growing industry all over

the world. The biggest challenge for most E-

Commerce businesses is to collect, store, analyse and

organize data from multiple data sources. There’s

certainly a lot of data waiting to be analysed and it is a

daunting task for some E-Commerce businesses to

make sense of it all [1]. One kind of data that has to

be analysed in E-Commerce business is web log file.

Web log file contains the following details: The IP

address of the computer making the request (i.e. the

visitor), the date and time of the hit, the request

method, the location and name of the requested file,

the HTTP status code, the size of the requested file

and etc. Mining the web log file will be always

helpful to E-Commerce companies to increase their

profits. Because when E-Commerce companies mine

the web log file they can predict the behaviour of their

online customers. Mining the web log file is called

Web Usage Mining. By predicting, E-Commerce

companies can offer an online customer a

personalized experience, including content and

promotions. Also, they can provide product

recommendations to customers based on their

browsing behaviour. E-Commerce companies can do

a lot more by mining the web log file. As the number

of customers visiting E-Commerce web sites are

increasing the size of the web log file is also

increasing and nowadays the size of web log file is in

petabytes. There are already pattern discovery data

mining techniques available to analyse the web log

files. These data mining techniques store web log file

in traditional DBMS and analyse. But in the current

scenario, the number of online customers’ increases

day by day and each click from a web page creates on

the order of 100 bytes of data in a typical website log

file [2]. Consequently, large websites handling

millions of simultaneous visitors can generate

hundreds of petabytes of logs per day. For example,

eBay processes petabytes of data stored in web log

file to create a better shopping experience. So, to

analyse such a big web log file efficiently and

effectively, we need to develop faster, efficient and

effective parallel and scalable data mining algorithms.

Also, we need a cluster of storage devices to store a

petabyte of web log data and parallel computing

model for analysing. Hadoop framework provides

reliable cluster of storage facility to keep our large

web log file data in a distributed manner and parallel

processing feature to process a large web log file data

efficiently and effectively. The remainder of the

paper is organized as follows. Section 2 summarizes

S Saravanan et al, Int.J.Computer Technology & Applications,Vol 5 (5),1677-1681

IJCTA | Sept-Oct 2014 Available [email protected]

1677

ISSN:2229-6093

Page 2: Analysing Large Web Log Files in a Hadoop Distributed Cluster

Master Node

192.168.2.1

Slave Node

192.168.2.2

Task Tracker

Job Tracker

Task Tracker

Name Node

Data Node Data Node

MapReduce Layer

HDFS Layer

2 NODE CLUSTER

the related work. In section 3, the system architecture

is discussed. Section 4 shows the proposed scheme.

Section 5 discusses the experimental results and in

section 6, paper is concluded.

2. Related work In [3], the SQL DBMS and Hadoop MapReduce

are compared and it is suggested that Hadoop

MapReduce performs better than the SQL DBMS. In

[4], it is mentioned that traditional DBMS cannot

handle a large dataset. So we need to have Big Data

technologies like Hadoop framework. Hadoop-

MapReduce [4][5][6] is used in many areas for big

data analysis. Hadoop is a good platform to analyse

the web log files as the size of the web log file is kept

increasing nowadays [7][8]. Apache Hadoop is an

open-source project created by Doug Cutting and

developed by the Apache Software Foundation.

Hadoop platform allows us to store large scale data

in thousands of nodes and analyse it. In [5], Generally

Hadoop cluster has thousands of nodes which store

multiple blocks of log files. Hadoop fragments log

files into blocks and these blocks are evenly

distributed over hundreds of nodes in a Hadoop

cluster. Also it replicates these blocks over the

multiple nodes to achieve reliability and fault

tolerance. MapReduce achieves parallel computation

by breaking analysing job into number of tasks.

3. System architecture

Figure 1. Two node hadoop cluster system architecture

Figure 1 shows the cluster configuration of Hadoop

system which is implemented in this paper. There are

2 nodes in the cluster. One node is called master node

and another one is called slave node. The architecture

is divided into two layers: HDFS Layer and

MapReduce Layer. Hadoop Distributed File System

(HDFS) is a Java-based file system that provides

scalable and reliable data storage that is designed to

span large clusters of commodity servers [9].

MapReduce Layer reads data from, writes data to

HDFS storage and processes the data in parallel.

Namenode keeps track of how weblog file is broken

down into file blocks, which nodes store those blocks.

Secondary name node periodically reads the HDFS

file system changes log and apply them into the

fsimage file. Data node stores the replication of web

log file. JobTracker determines the execution plan by

deciding which files to be processed, assigns nodes to

different tasks, and keeps track of all tasks as they are

running. TaskTracker is responsible for the execution

of individual tasks on each slave node.

4. Proposed scheme

4.1. Calculating the total number of hits

received by each URL

Figure 2. Calculating total number of hits received by each URL

Web Log File

Web Log

File Block1

Web Log

File Block2

Web Log File

BlockN

Split

(URL1,1),(U

RL2,1) ….

(URL1,1),(URL

2,1)...….

(URL1,1),

(URL2,1)...….

Map

(URL1,1),(U

RL1,1)….

(URL2,1),(U

RL2,1)….

(URLn,1),(URLn,1)….

Shuffle

(URL1,

Sum)

(URL2,

Sum)

(URLn, Sum)

Reduce

Total number of hits received by

each URL (URL1,Sum)

(URL2,Sum)….. (URLn,Sum)

Output

Input

S Saravanan et al, Int.J.Computer Technology & Applications,Vol 5 (5),1677-1681

IJCTA | Sept-Oct 2014 Available [email protected]

1678

ISSN:2229-6093

Page 3: Analysing Large Web Log Files in a Hadoop Distributed Cluster

Figure 2 depicts the MapReduce function of

processing web log file and calculating the total

number of hits received by each URL. The input to

this function is a web log file. For each hit in the web

site, a line will be added into the web log file. The line

in the web log file contains the following fields: client

IP address, User name, Server Name, date, time,

request method, requested resource, HTTP version,

HTTP Status and Bytes sent. Example line from a

NASA web log file: “in24.inetnebr.com - -

[01/Aug/1995:00:00:01 -0400] GET

/shuttle/missions/sts-68/news/sts-68-mcc-05.txt

HTTP/1.0 200 1839”. The web log file is split into

blocks by Hadoop Framework and stored into 2 node

cluster. In the mapper function, Each block of the web

log file is given as an input to a map function which in

turn parses each line using regular expression and

emits the URL as a key along with the value 1

(URL1,1), (URL2,1), (URL3,1),….,(URLn,1). After

mapping, the shuffling collects all the (Key, Value)

pairs which are having the same URL from different

mapping function’s and forms a group. After this

process, Group1 entries will be (URL1,1), (URL1,1),

(URL1,1) and so on. Group2 entries will be

(URL2,1), (URL2,1) and so on. Then, the reduce

function calculates the sum for each URL group. The

result of the reduce function is (URL1,SUM),

(URL2,SUM),…(URLn,SUM).

4.2. Calculating the total number of hits

received by a website in each hour

Figure 3. Calculating total number of hits received in every hour

Figure 3 depicts the MapReduce function of

processing web log file and calculating the total

number of hits received in every hour. The input to

this function is a web log file. The web log file is split

into blocks. In the mapper function, Each block of the

web log file is given as an input to a map function

which in turn parses each line using regular

expression and emits the hour as a key along with the

value 1 (hour0,1), (hour1,1), (hour3,1),….,(hour23,1).

After mapping, the shuffling collects all the

(Key,Value) pairs which are having the same hour

from different mapping function’s and forms a group.

After this, Group1 will be (hour0,1), (hour0,1),

(hour0,1) and so on. Group2 will be (hour1,1),

(hour1,1) and so on. The reduce function calculates

the sum for each hour group. The result of the reduce

function will be (hour0, SUM), (hour1, SUM),…

(hour23,SUM).

5. Experimental results This section discusses the results obtained from

the experiment.

5.1. Experimental setup To calculate the total number of hits received by

each URL and by a web site in each hour, a 2 node

Hadoop cluster is set up with the configurations

shown in Table 1.

Table 1. System configuration

Operating System Ubuntu 14.04

Hadoop Version Hadoop 1.2.1

Number of nodes in

the cluster

2 (192.168.2.1,

192.168.2.2)

Dataset Nasa Access Log

(July 1 – July 31,

1995)

Dataset Size 195 MB

5.2. Results of calculating the total number of

hits received by each URL Before executing the MapReduce code in the 2

nodes cluster environment, the web log file is loaded

into the HDFS of Hadoop framework. Total number

of hits in the web log file is 1891715. The first log

was collected from 00:00:00 July 1, 1995 through

23:59:59 July 31, 1995, a total of 31 days [10]. Figure

4 shows the contents of the output directory named

no_of_hits_by_URL in HDFS. The output is stored in

a file called part_r_00000. Figure 5 shows a chunk of

the output file which is generated when the

Web Log File

Web Log

File Block1

Web Log

File Block2

Web Log

File BlockN

Split

(hour0,1),(h

our1,1) ….

(hour0,1),(h

our1,1)...….

(hour0,1),

(hour1,1)…..

(hour1,1)...

….

Map

(hour1,1)

(hour1,1)….

(hour2,1)

(hour2,1)….

(hour23,1)

(hour23,1)

….

Shuffle

(hour1,

Sum)

(hour2,

Sum)

(hour23,

Sum)

Reduce

Total number of hits received by

website in every one hour

Output

Input

S Saravanan et al, Int.J.Computer Technology & Applications,Vol 5 (5),1677-1681

IJCTA | Sept-Oct 2014 Available [email protected]

1679

ISSN:2229-6093

Page 4: Analysing Large Web Log Files in a Hadoop Distributed Cluster

MapReduce code for calculating the number of hits

received by each URL is executed on the input web

log file.

Figure 4. no_hits_by_URL output directory in HDFS

When MapReduce function to calculate the total

number of hits received by each URL is executed,

CPU time spent is 42420 Milliseconds. The number of

map tasks launched is 3 and reduce tasks launched is

1. Time taken by map task is 32 Seconds and reduce

task is 44 Seconds.

Figure 5. A chunk of the number of hits received by

each URL output file in HDFS

5.2. Results of calculating the total number of

hits received by website in each hour

Figure 6. no_hits_by_Hour output directory in HDFS

When MapReduce function to calculate the total

number of hits received by a website in each hour is

executed, CPU time spent is 48390 Milliseconds. The

number of map tasks launched to process the dataset

is 3 and reduce tasks launched is 1. Time taken by

map task is 38 Seconds and reduce task is 23 Seconds.

Figure 7. Output: Number of hits received in each hour

S Saravanan et al, Int.J.Computer Technology & Applications,Vol 5 (5),1677-1681

IJCTA | Sept-Oct 2014 Available [email protected]

1680

ISSN:2229-6093

Page 5: Analysing Large Web Log Files in a Hadoop Distributed Cluster

Figure 6 shows the contents of the output directory

named no_of_hits_by_Hour in HDFS. The output is

stored in a file called part_r_00000. Figure 7 shows

the number of hits received by a web site in each

hour. This output is generated in HDFS storage after

executing the MapReduce Code on the input web log

file.

Figure 8. Pictorial representation of number of hits received in each hour

Figure 8 shows the pictorial representation of number

of hits received by a web site in each hour. From the

graph, it can be seen that during 9th hour maximum

number of hits are received.

6. Conclusion A web log file is stored in a 2 node Hadoop

distributed cluster environment and analysed. The

response time taken to analyse the web log file is very

less as the web log file is broken into blocks and

stored on 2 nodes cluster and analysed in parallel.

MapReduce programming model of Hadoop

framework is used to analyse the weblog file in

parallel. In this paper, the total number of hits

received by each URL and the total number of hits

received by a website in each hour are calculated. In

the future, the number of nodes in the cluster can be

increased and data mining techniques such as

recommendation, clustering and classification can be

applied on the web log file which is stored in the

hadoop file system to extract useful patterns from the

web log file. So that, E-Commerce companies can

provide a better shopping experience to their online

customers and increase their profits.

7. References [1] “Why Big Data is a must in E-Commerce”, Guest post

by Jerry Jao, CEO of Retention Science. http://www.bigdatalandscape.com/news/why-big-data-

is-a-must-in-ecommerce

[2] “3 approaches to big data analysis with Apache

Hadoop” by

DaveJaffe.http://www.dell.com/learn/us/en/19/power/

ps1q14-20140158-jaffe

[3] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel

J. Abadi, David J. DeWitt, Samuel Madden, Michael

Stonebraker, (2009) “A Comparison of Approaches to

Large-Scale Data Analysis”, ACM SIGMOD’09.

[4] Yogesh Pingle, Vaibhav Kohli, Shruti Kamat, Nimesh

Poladia, (2012) “Big Data Processing using Apache

Hadoop in Cloud System”, National Conference on Emerging Trends in Engineering & Technology.

[5] Tom White, (2009) “Hadoop: The Definitive Guide.

O’Reilly”, Scbastopol, California.

[6] Apache-Hadoop, http://Hadoop.apache.org

[7] Jeffrey Dean and Sanjay Ghemawat., (2004) “MapReduce: Simplified Data Processing on Large

Clusters”, Google Research Publication.

[8] Sayalee Narkhede and Tripti Baraskar., (2013) “HMR Log Analyzer: Analyze Web Application Logs Over

Hadoop MapReduce”, International Journal of

UbiComp (IJU), Vol.4, No.3, July 2013.

[9] http://hortonworks.com/hadoop/hdfs/

[10] http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html

S Saravanan et al, Int.J.Computer Technology & Applications,Vol 5 (5),1677-1681

IJCTA | Sept-Oct 2014 Available [email protected]

1681

ISSN:2229-6093