term paper on weka

10

Click here to load reader

Upload: kanishka-chakraborty

Post on 29-Nov-2014

1.412 views

Category:

Technology


2 download

DESCRIPTION

A term paper demonstrating the use of WEKA to conduct cluster analysis and regression analysis.

TRANSCRIPT

Page 1: Term Paper on WEKA

Regression Analysis and Cluster Analysis Using WEKA

2012

Kanishka Chakraborty (10BM60036) VGSoM, IIT Kharagpur

2010-2012

Page 2: Term Paper on WEKA

2

Table of Contents

Introduction ........................................................................................................... 3

Scope of this term paper ........................................................................................ 4

Data Used ................................................................................................................................ 4

Analysis Done ........................................................................................................................... 5

Analysis------------------------------------------------------------------------------------------------6

Regression Analysis ................................................................................................................. 6

Cluster Analysis ........................................................................................................................ 8

References-------------------------------------------------------------------------------------------10

Page 3: Term Paper on WEKA

3

Introduction

The amount of data generated is huge and growing at exponential rate each moment. But data

is not much of use in itself. It must be into information that can be interpreted and used. There

are multiple methods to convert data into information. Data mining is one of the methods

which help in deducing meaningful patterns and facts from the data. It has an application in

every walk of life. Any organization must rely on data mining in order to get proper insights on

which there decisions will be based. Many data mining tools are present in the market. WEKA

(Waikato Environment for Knowledge Analysis) is one such data mining tool. It is the only

toolkit that has gained such widespread popularity.

It is a java-based free tool available under GNU General Public License. It consists of many

features and hence has made it quite a popular data mining tool. It consists of many

visualization tools, algorithms and preprocessing & modeling techniques to conduct data

mining. It provides the user with both a GUI (Graphical User Interface) and CLI (Command Line

Interface).

The applications available:

Explorer: An environment to analyze data in WEKA

Experimenter: Environment for conducting statistical tests

KnowledgeFlow: Same as explorer with additional feature of drag-and-drop

Simple CLI: Provides command line interface for WEKA

The tool requires the data to be in .arff format. Arff stands for Attribute Relation File Format. It

is an ASCII file with all the attributes, their relation and values for each instance. It consists of

three parts: Relation, Attribute and Data.

Page 4: Term Paper on WEKA

4

Scope of this term-paper

This paper deals with the analysis of telecom customers about their Value Added Services usage

pattern and experience. This analysis is being carried out in order to identify customers who are

likely to go for a service like 3G. The paper will also try to identify which factors are important in

order to assess which customer will adopt 3G. This information plays a major role in creation of

the marketing strategy of 3G.

DATA USED

The data that has been used for this paper was collected with the help of a survey conducted in

Guwahati, Assam. This is being done in order to identify important factors differentiating

between potential 3G customers and non-3G potential customers. The sample size used for this

analysis is 206 and consists of the following demographic segments:

Students

Young Professionals (<35 years of age),Working Professionals (>35 years of age)

Housewives

Defense personnel

Low Income Group (Rickshaw drivers, Auto rickshaw drivers, Shopkeepers etc.)

Variable Description Categories

Monthly expenditure on

VAS

How much the customer spends on VAS in a month

<100, 100-300, 300-500, >500

Mobile Internet Whether the customer uses

internet on their mobile Yes, No

Internet speed experience

What has been the mobile internet usage satisfaction level of the

customers

Satisfied. Neither Satisfied nor Dissatisfied, Dissatisfied, Not used

3G Awareness How aware is the customer regarding the 3G services

Using 3G, Fully Aware, Partially Aware, Not Aware

Handset Price What is the price of the handset the

customer is using

<3000, 3000-5000, 5000-7000, 7000-10000, 10000-15000, 15000-20000, 20000-30000,

>30000

3G usage plan Whether the customer is planning

to use 3G in the near future Yes, No

Demography The age-occupation combination of

the customer

Low income group, Housewives, Defense, Young Professionals, Working Professionals,

Students

Page 5: Term Paper on WEKA

5

To be usable in WEKA the data was first converted in .arff format. This is done by introducing a

few things:

Attribute: Each variable is defined as an attribute. The data type (numeric, string etc.) is

also defined for each attribute

Data: The instances are input under the data header. It consists of the value for each

attribute for the instances.

ANALYSIS DONE

The following analysis will be conducted using the tool:

Regression

Clustering

Regression will be carried out in order to understand the relation between the various variables

used in the data in order to predict how any variable will vary with respect to some other

variable(s). Clustering is a technique that helps to form different groups and assign each

instance to one group or another. Each group consists of instances which are similar to each

other. It has widespread usage in segmenting customers according to their characteristics and

preferences.

Page 6: Term Paper on WEKA

6

Analysis

Regression Analysis

The regression analysis is used to understand the relation that a particular variable (Dependent

variable) share with others (Independent variable). For this paper the factors studied are as

follows:

Dependent Variable: Plan to use 3G

Independent Variable:

o Internet mobile user

o 3G awareness

o Price of the handset used

STEPS TO FOLLOW

I. Select Classify tab

II. Click on the Choose button

III. Go to functions

IV. Select LinearRegression from the list

V. Enter the % of data wanted for the test (rest will be used for validation) from Test

options

VI. Click on Start to perform the analysis

Page 7: Term Paper on WEKA

7

OUTPUT

The regression analysis conducted on the data gives us the following equation:

3G Planner = 0.4599 * (Internet Mobile User) + 0.0891 * (3G awareness) - 0.1325 * (Handset

price) + 0.9421

ANALYSIS OF THE OUTPUT

The output received leads to the following interpretations:

Whether a person is planning to buy 3G depends upto a great extent to whether that

person is using internet on their mobile or not. A person who is using internet on their

mobile is more likely to try 3G.

Dependence of 3G trial plan also relates to the price of the handset the respondent is

currently using. Higher the price higher is the likelihood that the person will try 3G.

The plan for 3G usage also depends on the 3G awareness level. The dependence is

weak. According to the output the higher the awareness about 3G more likely it is that

the person will try 3G.

Page 8: Term Paper on WEKA

8

Cluster Analysis

Before creating a marketing strategy for any product it is very important to identify particular

segments present in the market. These segments can then be studied in order to select the one

which is best suited for targeting. For identifying the segments present in the market clustering

can be used. For this paper, K Means Clustering has been used.

STEPS TO FOLLOW

I. Select Cluster tab

II. Click on the Choose button

III. Select SimpleKMeans from the list

IV. Click on the text box besides the Choose button. Enter the number of clusters you want

to have in numclusters

V. Click on Start to perform the analysis

Page 9: Term Paper on WEKA

9

OUTPUT

The outputs obtained are as follows:

Cluster centroids

The centroids obtained by clustering helps in understanding the characteristics of each

segment. It provides us with information regarding each cluster according to the various

variables.

Attribute Cluster 0 1 2 3

Membership (65) (61) (41) (39)

Monthly Expense on VAS 1.1692 1 1.1951 1.3333

Mobile Internet user .6923 2 1.7317 .9487

Satisfaction level of mobile internet usage 1.9385 0 0 1.2462

3G Awareness 2.4769 2.6885 2.6098 2.1538

Demography 4.18154 2.3607 4.6829 5.2821

Price of Handset used 2.3385 2.1311 3.3415 3.8769

3G usage plan 2 2 1.9268 .8974

Clustered Instances

Cluster instances basically give information regarding the number of instances that belong to

each cluster. This aids in predicting what percentage of the total population is likely to belong

to each cluster

Cluster 0: 65 (32%)

Cluster 1: 61 (30%)

Cluster 2: 41 (20%)

Cluster 3: 39 (19%)

Page 10: Term Paper on WEKA

10

ANALYSIS OF THE OUTPUT

In K Means Clustering the number of clusters to be formed is entered by the user. Here the

number of clusters to be formed by the clustering tool has been assigned as 4. WEKA provided

us with the description of each cluster in terms of the centroids of each variable with respect to

the cluster. The cluster descriptions are as follows:

Attribute Cluster 0 1 2 3

Membership (65) (61) (41) (39)

Monthly Expense on VAS

<100 <100 <100 0-300

Mobile Internet user

Yes No No Yes

Satisfaction level of mobile internet usage

Not Satisfied Haven’t used

Haven’t used Satisfied

3G Awareness Low awareness

Low awareness

Low awareness

Fully Aware

Demography Working Professionals

House wives

Working Professional

Young Professionals &

Students

Price of Handset used

3000-5000 3000-5000 5000-7000 7000-10000

3G usage plan No No No Yes

Thus the segment to be targeted initially is the cluster 3. It consists of Young working

professionals (< 35 years of age) and students. This segment is the most likely to go for 3G

services. The awareness level of this segment is fairly high. The handset used by the members

in this segment is in the price band of 7000-10000. The members of this segment are satisfied

with the speed of internet they receive on their handsets. The cluster membership of this

segment is 19%. Thus it can be deduced according to the analysis that around 19% of the total

population consists of customers who are likely to go for a service like 3G.

References

http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html

http://en.wikipedia.org/wiki/Weka_%28machine_learning%29

http://sourceforge.net/projects/weka/files/documentation/3.6.x/WekaManual-3-6-

2.pdf/download