department - جامعة نزوى · dataminig (weka) lab manual. table of contents s.no week topic...

53
Department of Information Systems DATAMINIG (WEKA) LAB MANUAL

Upload: duongque

Post on 22-Apr-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Department

of

Information Systems

DATAMINIG (WEKA)

LAB MANUAL

Table of Contents

S.No Week Topic Page No

1. Week-1 How to Create Attribute Relation File Format (.arff) 1

2. Week-2 How to Create CSV (Comma-Separated Values) File 3

3. Week-3 Data Pre-Processing in Weka 6

4. Week-4 Data Discretization in Weka 17

5. Week-5 & 6 Association Rule Mining in Weka 24

6. Week-7 & 8 Classification via Decision Tree in Weka 32

7. Week-9 & 10 K-Means Clustering in Weka 45

8. Week-11 & 12 Using Visualization in Weka 24-45

9. Week-13 & 14 Using the Command Line 6-45

0 | P a g e

1 | P a g e

Attribute-Relation File Format (ARFF)

An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of

instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project

at the Department of Computer Science of The University of Waikato for use with the Weka

machine learning software. This document describes the version of ARFF used with Weka

versions 3.2 to 3.3; this is an extension of the ARFF format as described in the data mining book

written by Ian H. Witten and Eibe Frank (the new additions are string attributes, date attributes,

and sparse instances).

Overview

ARFF files have two distinct sections. The first section is the Header information, which is

followed the Data information.

The Header of the ARFF file contains the name of the relation, a list of the attributes (the

columns in the data), and their types. An example header on the standard IRIS dataset looks like

this:

% 1. Title: Iris Plants Database

%

% 2. Sources:

% (a) Creator: R.A. Fisher

% (b) Donor: Michael Marshall (MARSHALL%[email protected])

% (c) Date: July, 1988

%

@RELATION iris

@ATTRIBUTE sepallength NUMERIC

@ATTRIBUTE sepalwidth NUMERIC

@ATTRIBUTE petallength NUMERIC

@ATTRIBUTE petalwidth NUMERIC

@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

2 | P a g e

The DATA of the arff file looks like the following:

@DATA

5.1,3.5,1.4,0.2,Iris-setosa

4.9,3.0,1.4,0.2,Iris-setosa

4.7,3.2,1.3,0.2,Iris-setosa

4.6,3.1,1.5,0.2,Iris-setosa

5.0,3.6,1.4,0.2,Iris-setosa

5.4,3.9,1.7,0.4,Iris-setosa

4.6,3.4,1.4,0.3,Iris-setosa

5.0,3.4,1.5,0.2,Iris-setosa

4.4,2.9,1.4,0.2,Iris-setosa

4.9,3.1,1.5,0.1,Iris-setosa

Lines that begin with a % are comments.

The @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

Example:

@relation LCCvsLCSH

@attribute LCC string

@attribute LCSH string

@data

AG5, 'Encyclopedias and dictionaries.;Twentieth century.'

AS262, 'Science -- Soviet Union -- History.'

AE5, 'Encyclopedias and dictionaries.'

AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.'

AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Tables.'

3 | P a g e

HOW TO CREATE CSV FILE

A comma-separated values (CSV) file is any file containing text that is separated with a comma,

but can also be a file separated with any other character. To create a CSV is as simple as creating

any text file and can be created in any text editor, however, is often created in a spreadsheet

program such as Microsoft Excel or Open Office Calc. Below are the steps on how to create a

CSV file using a text editor such as Notepad, Microsoft Excel, Open Office Calc, and Google

Docs.

1. Notepad (or any text editor)

2. Microsoft Excel

Notepad (or any text editor)

To create a CSV file in a text editor open a new text editor program, such as Notepad. Once open

write the text data you wish the file to contain and separate each field or column of data with a

comma and each row with a new line.

Title1,Title2,Title3

one,two,three

example1,example2,example3

As an example, if you were to create a text file with the above data, and save it as a CSV file,

each column is created by each comma and each row is created by each new line. Therefore, the

above data if opened in a spreadsheet program such as Microsoft Excel would create a table

similar to the below example.

4 | P a g e

Title1 Title2 Title3

one two three

example1 example2 example3

If the data you're planning to use in your CSV file already has commas, such as an address; it's

easier to use a different delimiter to separate each of the values. For example, in the below CSV

file we will be creating names and addresses for labels that will be printed and are separating

each name and address label with a tilde character. Alternatively a better solution would be to

have the address, city, and state in their own column.

Name

Street Address

City, State ZIP Code

~Mr John Smith

123 Fake address

Salt Lake City, Utah 89110

~Mrs Jane Doe

586 Another fake

Delta, Utah 84624

~Bill White

123 N Fake Street

St Anthony, Idaho 83445

Microsoft Excel

Open Microsoft Excel and the file you wish to save as a CSV file. For example, below is

the data contained in our example Excel worksheet?

Item Cost Sold Profit

Keyboard $10.00 $16.00 $6.00

Monitor $80.00 $120.00 $40.00

Mouse $5.00 $7.00 $2.00

Total $48.00

5 | P a g e

Once open, click File, choose the Save As option, and as the Save as type: select the CSV

(Comma delimited) (*.csv) option.

Once saved, if you were to open the CSV file in a text editor, such as Notepad, the CSV file

should resemble the below example.

Item,Cost,Sold,Profit

Keyboard,$10.00,$16.00,$6.00

Monitor,$80.00,$120.00,$40.00

Mouse,$5.00,$7.00,$2.00

,,Total,$48.00

6 | P a g e

Data Preprocessing in WEKA

This Program illustrates some of the basic data preprocessing operations that can be performed

using WEKA. The sample data set used for this example, unless otherwise indicated, is the "bank

data" available in comma-separated format (bank-data.csv).

The data contains the following fields

id a unique identification number

age age of customer in years (numeric)

sex MALE / FEMALE

region inner_city/rural/suburban/town

income income of customer (numeric)

married is the customer married (YES/NO)

children number of children (numeric)

car does the customer own a car (YES/NO)

save_acct does the customer have a saving account (YES/NO)

current_acct does the customer have a current account (YES/NO)

mortgage does the customer have a mortgage (YES/NO)

pep did the customer buy a PEP (Personal Equity Plan) after the last mailing

(YES/NO)

7 | P a g e

Loading the Data

In addition to the native ARFF data file format, WEKA has the capability to read in ".csv"

format files. This is fortunate since many databases or spreadsheet applications can save or

export data into flat files in this format. As can be seen in the sample data file, the first row

contains the attribute names (separated by commas) followed by each data row with attribute

values listed in the same order (also separated by commas). In fact, once loaded into WEKA, the

data set can be saved into ARFF format. If, however, you are interested in conveting a ".csv" file

into WEKA's native ARFF using the command line, this can be accomplished using the

following command:

java weka.core.converters.CSVLoader filename.csv > filename.arff

In this example, we load the data set into WEKA, perform a series of operations using WEKA's

attribute and discretization filters, and then perform association rule mining on the resulting data

set. While all of these operations can be performed from the command line, we use the GUI

interface for WEKA Explorer.

Initially (in the Preprocess tab) click "open" and navigate to the directory containing the data file

(.csv or .arff). In this case we will open the above data file. This is shown in Figure p1.

8 | P a g e

Since the data is not in ARFF format, a dialog box will prompt you to use the convertor, as in

Figure p2. You can click on "Use Covertor" button, and click OK in the next dialog box that

appears (See Figure p3).

9 | P a g e

10 | P a g e

Once the data is loaded, WEKA will recognize the attributes and during the scan of the data will

compute some basic statistics on each attribute. The left panel in Figure p4 shows the list of

recognized attributes, while the top panels indicate the names of the base relation (or table) and

the current working relation (which are the same initially).

Clicking on any attribute in the left panel will show the basic statistics on that attribute. For

categorical attributes, the frequency for each attribute value is shown, while for continuous

attributes we can obtain min, max, mean, standard deviation, etc. As an example, see Figures p5

and p6 below which show the results of selecting the "age" and "married" attributes,

respectively.

11 | P a g e

12 | P a g e

Note that the visualization in the right bottom panel is a form of cross-tabulation across two

attributes. For example, in Figure p6 above, the default visualization panel cross-tabulates

"married" with the "pep" attribute (by default the second attribute is the last column of the data

file). You can select another attribute using the drop down list.

Selecting or Filtering Attributes

In our sample data file, each record is uniquely identified by a customer id (the "id" attribute).

We need to remove this attribute before the data mining step. We can do this by using the

Attribute filters in WEKA. In the "Filter" panel, click on the "Choose" button. This will show a

popup window with a list available filters. Scroll down the list and select the

"weka.filters.unsupervised.attribute.Remove" filter as shown in Figure p7.

13 | P a g e

Next, click on text box immediately to the right of the "Choose" buttom. In the resulting dialog

box enter the index of the attribute to be filtered out (this can be a range or a list separated by

commas). In this case, we enter 1 which is the index of the "id" attribute (see the left panel).

Make sure that the "invertSelection" option is set to false (otherwise everything except attribute 1

will be filtered). Then click "OK" (See Figure p8). Now, in the filter box you will see "Remove -

R 1" (see Figure p9).

14 | P a g e

Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and

create a new working relation (whose name now includes the details of the filter that was

applied). The result is depicted in Figure p10:

15 | P a g e

It is possible now to apply additional filters to the new working relation. In this example,

however, we will save our intermediate results as separate data files and treat each step as a

separate WEKA session. To save the new working relation as an ARFF file, click on save button

in the top panel. Here, as shown in the "save" dialog box (see Figure p11), we will save the new

relation in the file "bank-data-R1.arff".

Figure p12 shows the top portion of the new generated ARFF file (in TextPad).

16 | P a g e

Note that in the new data set, the "id" attribute and all the corresponding values in the records

have been removed. Also, note that Weka has automatically determined the correct types and

values associated with the attributes, as listed in the Attributes section of the ARFF file.

17 | P a g e

Discretization

Some techniques, such as association rule mining, can only be performed on categorical data.

This requires performing discretization on numeric or continuous attributes. There are 3 such

attributes in this data set: "age", "income", and "children". In the case of the "children" attribute

the range of possible values are only 0, 1, 2, and 3. In this case, we have opted for keeping all of

these values in the data. This means we can simply discretize by removing the keyword

"numeric" as the type for the "children" attribute in the ARFF file, and replacing it with the set of

discrete values. We do this directly in our text editor as seen in Figure p13. In this case, we have

saved the resulting relation in a separate file "bank-data2.arff".

We will rely on WEKA to perform discretization on the "age" and "income" attributes. In this

example, we divide each of these into 3 bins (intervals). The WEKA discretization filter, can

divide the ranges blindly, or used various statistical techniques to automatically determine the

best way of partitioning the data. In this case, we will perform simple binning.

18 | P a g e

First we will load our filtered data set into WEKA by opening the file "bank-data2.arff". The

"open" dialog box in depicted in Figure p14.

If we select the "children" attribute in this new data set, we see that it is now a categorical

attribute with four possible discrete values. This is depicted in Figure p15.

19 | P a g e

Now, once again we activate the Filter dialog box, but this time, we will select

"weka.filters.unsupervised.attribute.Discretize" from the list (see Figure p16).

20 | P a g e

Next, to change the defaults for this filter, click on the box immediately to the right of the

"Choose" button. This will open the Discretize Filter dialog box. We enter the index for the the

attributes to be discretized. In this case we enter 1 corresponding to attribute "age". We also enter

3 as the number of bins (note that it is possible to discretize more than one attribute at the same

time (by using a list of attribute indeces). Since we are doing simple binning, all of the other

available options are set to "false". The dialog box is depicted in Figure p17.

Click "Apply" in the Filter panel. This will result in a new working relation with the selected

attribute partitioned into 3 bins (see Figure p18). To examine the results, we save the new

working relation in the file "bank-data3.arff" as depicted in Figure p19.

21 | P a g e

Let us now examine the new data set using our text editor (in this case, TextPad). The top portion

of the data is shown in Figure p19. You can observe that WEKA has assigned its own labels to

each of the value ranges for the discretized attribute. For example, the lower range in the "age"

22 | P a g e

attribute is labeled "(-inf-34.333333]" (enclosed in single quotes and escape characters), while

the middle range is labeled "(34.333333-50.666667]", and so on. These labels now also appear in

the data records where the original age value was in the corresponding range.

Next, we apply the same process to discretize the "income" attribute into 3 bins. Again, Weka

automatically performs the binning and replaces the values in the "income" column with the

appropriate automatically generated labels. We save the new file into "bank-data3.arff",

replacing the older version.

Clearly, the WEKA labels, while readable, leave much to be desired as far as naming

conventions go. We will thus use the global search/replace functions in TextPad to replace these

labels with more succinct and readable ones. Fortunately, TextPad has a powerful regular

expression pattern matching capability which allows us to do this efficiently. The TextPad

search/replace dialog box for replacing the age label "(-inf-34.333333]" with the label "0_34".

Note that the "regular expression" option is selected. In the "Find what" box we have entered the

full label '\'(-inf-34.333333]\'' (including the back-slashes and single quotes). Furthermore, back-

slashes are escaped with another back-slash so that in the regular expression patterns matching

they are treated as literals (resulting in: '\\'(-inf-34.333333]\\''. In the "Replace with" box we enter

"0_34".

Now we click on the "Replace All" button to replace all instances of the old patterns with the

new one. The result of this operation is depicted in Figure p20.

23 | P a g e

Note that the new label now appears in place of the old one both in the attribute section of the

ARFF file as well as in the relevant data records. We repeat this manual re-labeling process with

all of the WEKA-assigned labels for the "age" and the "income" attributes. Figure p21 shows the

final result of the transformation and the newly assigned labels for these attribute values.

We now also change the relation name in the ARFF file to "bank-data-final" and save the file as

"bank-data-final.arff".

24 | P a g e

Association Rule Mining with WEKA

This Program illustrates some of the basic elements of associate rule mining using WEKA. The

sample data set used for this example, unless otherwise indicated, is the "bank data" described in

(Data Preprocessing in WEKA). In this case, our starting point is the discretized data obtained

after performing the preprocessing tasks. Figure a1 shows the WEKA explorer interface after

opening this data file ("bank-data-final.arff").

Clicking on the "Associate" tab will bring up the interface for the association rule algorithms.

The Apriori algorithm which we will use is the deafult algorithm selected. However, in order to

change the parameters for this run (e.g., support, confidence, etc.) we click on the text box

immediately to the right of the "Choose" button. Note that this box, at any given time, shows the

specific commandline arguments that are to be used for the algorithm. The dialog box for

25 | P a g e

changing the parameters is depicted in Figure a2. Here, you can specify various parameters

associated with Apriori. Click on the "More" button to see the synopsis for the different

parameters.

WEKA allows the resulting rules to be sorted according to different metrics such as confidence,

leverage, and lift. In this example, we have selected lift as the criteria. Furthermore, we have

entered 1.5 as the minimum value for lift (or improvement) is computed as the confidence of the

rule divided by the support of the right-hand-side (RHS). In a simplified form, given a rule L =>

R, lift is the ratio of the probability that L and R occur together to the multiple of the two

individual probabilities for L and R, i.e.,

lift = Pr(L,R) / Pr(L).Pr(R).

If this value is 1, then L and R, are independent. The higher this value, the more likely that the

existence of L and R together in a transaction is not just a random occurrence, but because of

some relationship between them.

26 | P a g e

Here we also change the default value of rules (10) to be 100; this indicates that the program will

report no more than the top 100 rules (in this case sorted according to their lift values). The

upper bound for minimum support is set to 1.0 (100%) and the lower bound to 0.1 (10%).

Apriori in WEKA starts with the upper bound support and incrementally decreases support (by

delta increments which by default is set to 0.05 or 5%). The algorithm halts when either the

specified number of rules are generated, or the lower bound for min. support is reached. The

significance testing option is only applicable in the case of confidence and is by default not used

(-1.0).

The final selection of parameters for our current run is depicted in Figure a3:

Once the parameters have been set, the commandline text box will show the new command line.

We now click on start to run the program. This results in a set of rules as depicted in Figure a4.

27 | P a g e

The panel on the left ("Result list") now shows an item indicating the algorithm that was run and

the time of the run. You can perform multiple runs in the same session each time with different

paprmeters. Each run will appear as an item in the Result list panel. Clicking on one of the

results in this list will bring up the details of the run, including the discovered rules in the right

panel. In addition, right-clicking on the result set allows us to save the result buffer into a

separate file. In this case, we save the output in the file bank-data-ar1.txt. A portion of this file is

depicted in Figure a5:

28 | P a g e

Note that the rules were discovered based on the specified threshold values for support and lift.

For each rule, the frequency counts for the LHS and RHS of each rule is given, as well as the

values for confidence, lift, leverage, and conviction. Note that leverage and lift measure similar

things, except that leverage measures the difference between the probability of co-occurrence of

L and R (see above example) as the independent probabilities of each of L and R, i.e.,

leverage = Pr(L,R) - Pr(L).Pr(R).

In other words, leverage measures the proportion of additional cases covered by both L and R

above those expected if L and R were independent of each other. Thus, for leverage, values

29 | P a g e

above 0 are desirable, whereas for lift, we want to see values greater than 1. Finally, conviction

is similar to lift, but it measures the effect of the right-hand-side not being true. It also inverts the

ratio. So, convictions is measured as:

conviction = Pr(L).Pr(not R) / Pr(L,R).

Thus, conviction, in contrast to lift is not symmetric (and also has no upper bound).

In most cases, it is sufficient to focus on a combination of support, confidence, and either lift or

leverage to quantitatively measure the "quality" of the rule. However, the real value of a rule, in

terms of usefulness and actionability is subjective and depends heavily of the particular domain

and business objectives.

Using the Command Line

In general, using WEKA from the command line provides more flexibility that using the GUI

version (we will discuss this more in the context of classification). In the case of association

rules, the GUI version does not provide the ability to save the frequent itemsets (independently

of the generated rules). We can do this using the command line. If we look at the output of the

association rule mining from the above example (the file bank-data-ar1.txt), the actual command

line options are given under the "Run information" at the top. In the example, this command line

is:

weka.associations.Apriori -N 100 -T 1 -C 1.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0

We can use this directly using the "Simple CLI" interface.

In the main WEKA interface, click "Simple CLI" button to start the command line interface. The

main command for generating the rules as we did above is:

30 | P a g e

java weka.associations.Apriori options -t directory-path\bank-data-final.arff

where the word options is replaced with the command line options, which for the above example

are:

-N 100 -T 1 -C 1.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0

The additional "-t directory-path\bank-data-final.arff" option tells WEKA to use the file "bank-

data-final.arff" as the input file (located in the specified directory). This command will produce

exactly the same output as the previous GUI example. However, we can add an additional option

("-I") which results in the generation of all frequent itemsets:

java weka.associations.Apriori options -I -t directory-path\bank-data-final.arff

This command as it is used in the SimpleCLI interface is depicted in Figure a6:

31 | P a g e

When ready, press enter to run the program with the indicated options. The result of this

command will be displayed in the top panel of the Simple CLI interface. Here, the results have

been saved into a file bank-data- ar2.txt. You will notice that before the rules, the output includes

itemset of various sizes generated at different iterations of Apriori algorithm (in this case, L1

through L5) along with the support count for each itemset. In the case of L1, these are simply the

individual items (attributes) that meet the minimum support threshold.

32 | P a g e

Classification via Decision Trees in WEKA

This example illustrates the use of C4.5 (J48) classifier in WEKA. The sample data set used for

this example, unless otherwise indicated, is the bank data available in comma-separated format

(bank-data.csv). This document assumes that appropriate data preprocessing has been perfromed.

In this case ID field has been removed. Since C4.5 algorithm can handle numeric attributes, there

is no need to discretize any of the attributes. For the purposes of this example, however, the

"Children" attribute has been converted into a categorical attribute with values "YES" or "NO".

WEKA has implementations of numerous classification and prediction algorithms. The basic

ideas behind using all of these are similar. In this example we will use the modified version of

the bank data to classify new instances using the C4.5 algorithm (note that the C4.5 is

implemented in WEKA by the classifier class: weka.classifiers.trees.J48). The modified (and

smaller) version of the bank data can be found in the file "bank.arff" and the new unclassified

instances are in the file "bank-new.arff".

As usual, we begin by loading the data into WEKA, as seen in Figure 1:

33 | P a g e

Figure: 1

Next, we select the "Classify" tab and click the "Choose" button to select the J48 classifier, as

depicted in Figures 21-a and 21-b. Note that J48 (implementation of C4.5 algorithm) does not

require discretization of numeric attributes, in contrast to the ID3 algorithm from which C4.5 has

evolved.

Figure 2-a Figure 2-b

34 | P a g e

Now, we can specify the various parameters. These can be specified by clicking in the text box

to the right of the "Choose" button, as depicted in Figure 3. In this example we accept the default

values. The default version does perform some pruning (using the subtree raising approach), but

does not perform error pruning. The selected parameters are depicted in Figure 3.

Figure 3

Under the "Test options" in the main panel we select 10-fold cross-validation as our evaluation

approach. Since we do not have separate evaluation data set, this is necessary to get a reasonable

idea of accuracy of the generated model. We now click "Start" to generate the model. The ASCII

version of the tree as well as evaluation statistics will appear in the eight panel when the model

construction is completed (see Figure 4).

35 | P a g e

Figure 4

We can view this information in a separate window by right clicking the last result set (inside the

"Result list" panel on the left) and selecting "View in separate window" from the pop-up menu.

These steps and the resulting window containing the classification results are depicted in Figures

5-a and 5-b.

Figure 5-a Figure 5-b

36 | P a g e

Note that the classification accuracy of our model is only about 69%. This may indicate that we

may need to do more work (either in preprocessing or in selecting the correct parameters for

classification), before building another model. In this example, however, we will continue with

this model despite its inaccuracy.

WEKA also let's us view a graphical rendition of the classification tree. This can be done by

right clicking the last result set (as before) and selecting "Visualize tree" from the pop-up menu.

The tree for this example is depicted in Figure 6. Note that by resizing the window and selecting

various menu items from inside the tree view (using the right mouse button), we can adjust the

tree view to make it more readable.

Figure 6

We will now use our model to classify the new instances. A portion of the new instances ARFF

file is depicted in Figure 7. Note that the attribute section is identical to the training data (bank

37 | P a g e

data we used for building our model). However, in the data section, the value of the "pep"

attribute is "?" (or unknown).

Figure 7

In the main panel, under "Test options" click the "Supplied test set" radio button, and then click

the "Set..." button. This will pop up a window which allows you to open the file containing test

instances, as in Figures 8-a and 8-b.

Figure 8-a Figure 8-b

In this case, we open the file "bank-new.arff" and upon returning to the main window, we click

the "start" button. This, once again generates the models from our training data, but this time it

38 | P a g e

applies the model to the new unclassified instances in the "bank-new.arff" file in order to predict

the value of "pep" attribute. The result is depicted in Figure 9. Note that the summary of the

results in the right panel does not show any statistics. This is because in our test instances the

value of the class attribute ("pep") was left as "?", thus WEKA has no actual values to which it

can compare the predicted values of new instances.

Figure 9

Of course, in this example we are interested in knowing how our model managed to classify the

new instances. To do so we need to create a file containing all the new instances along with their

predicted class value resulting from the application of the model. Doing this is much simpler

using the command line version of WEKA classifier application. However, it is possible to do so

in the GUI version using an "indirect" approach, as follows.

39 | P a g e

First, right-click the most recent result set in the left "Result list" panel. In the resulting pop-up

window select the menu item "Visualize classifier errors". This brings up a separate window

containing a two-dimensional graph. These steps and the resulting window are shown in Figures

9 and 10.

Figure 10

For now, we are not interested in what this graph represents. Rather, we would like to "save" the

classification results from which the graph is generated. In the new window, we click on the

"Save" button and save the result as the file: "bank-predicted.arff", as shown in Figure 11.

40 | P a g e

Figure 11

This file contains a copy of the new instances along with an additional column for the predicted

value of "pep". The top portion of the file can be seen in Figure 12.

Figure 12

Note that two attributes have been added to the original new instances data: "Instance_number"

and "predictedpep". These correspond to new columns in the data portion. The "predictedpep"

41 | P a g e

value for each new instance is the last value before "?" which the actual "pep" class value. For

example, the predicted value of the "pep" attribute for instance 0 is "YES" according to our

model, while the predicted class value for instance 4 is "NO".

Using the Command Line (Recommended)

While the GUI version of WEKA is nice for visualizing the results and setting the parameters

using forms, when it comes to building a classification (or predictions) model and then applying

it to new instances, the most direct and flexible approach is to use the command line. In fact, you

can use the GUI to create the list of parameters (for example in case of the J48 class) and then

use those parameters in the command line.

In the main WEKA interface, click "Simple CLI" button to start the command line interface. The

main command for generating the classification model as we did above is:

java weka.classifiers.trees.J48 -C 0.25 -M 2 -t directory-path\bank.arff -d directory-path

\bank.model

The options -C 0.25 and -M 2 in the above command are the same options that we selected for

J48 classifier in the previous GUI example (see Figure 3). The -t option in the command

specifies that the next string is the full directory path to the training file (in this case "bank.arff").

In the above command directory-path should be replaced with the full directory path where the

training file resides. Finally, the -d option specifies the name (and location) where the model will

be stored. After executing this command inside the "Simple CLI" interface, you should see the

tree and stats about the model in the top window (See Figure 13).

42 | P a g e

Figure 13

Based on the above command, our classification model has been stored in the file "bank.model"

and placed in the directory we specified. We can now apply this model to the new instances. The

advantage of building a model and storing it is that it can be applied at any time to different sets

of unclassified instances. The command for doing so is:

java weka.classifiers.trees.J48 -p 9 -l directory-path\bank.model -T directory-path \bank-

new.arff

43 | P a g e

In the above command, the option -p 9 indicates that we want to predict a value for attribute

number 9 (which is "pep"). The -l options specifies the directory path and name of the model file

(this is what was created in the previous step). Finally, the -T option specifies the name (and

path) of the test data. In our example, the test data is our new instances file "bank-new.arff").

This command results in a 4-column output similar to the following:

0 YES 0.75 ?

1 NO 0.7272727272727273 ?

2 YES 0.95 ?

3 YES 0.8813559322033898 ?

4 NO 0.8421052631578947 ?

The first column is the instance number assigned to the new instances in "bank-new.arff" by

WEKA. The 2nd column is the predicted value of the "pep" attribute for the new instance. The

3rd column is the confidence (prediction accuracy) for that instance. Finally, the 4th column in

the actual "pep" value in the test data (in this case, we did not have a value for "pep" in "bank-

new.arff", thus this value is "?"). For example, in the above output, the predicted value of "pep"

in instance 2 is "YES" with a confidence of 95%. Portion of the final result are depicted in

Figure 14.

44 | P a g e

Figure 14

The above output is preferable over the output derived from the GUI version on WEKA. First,

this is a more direct approach which allows us to save the classification model. This model can

be applied to new instance later without having to regenerate the model. Secondly (and more

importantly), in contrast to the final output of the GUI version, in this case we have independent

confidence (accuracy) values for each of the new instances. This means that we can focus only

on those prediction with which we are more confident. For example, in the above output, we

could filter out any instance whose predicted value has an accuracy of less than 85%.

45 | P a g e

K-Means Clustering in WEKA

This example illustrates the use of k-means clustering with WEKA The sample data set used for

this example is based on the "bank data" available in comma-separated format (bank-data.csv).

This document assumes that appropriate data preprocessing has been perfromed. In this case a

version of the initial data set has been created in which the ID field has been removed and the

"children" attribute has been converted to categorical (This, however, is not necessary for

clustering).

The resulting data file is "bank.arff" and includes 600 instances. As an illustration of performing

clustering in WEKA, we will use its implementation of the K-means algorithm to cluster the

customers in this bank data set, and to characterize the resulting customer segments. Figure 1

shows the main WEKA Explorer interface with the data file loaded.

Figure 1

46 | P a g e

Some implementations of K-means only allow numerical values for attributes. In that case, it

may be necessary to convert the data set into the standard spreadsheet format and convert

categorical attributes to binary. It may also be necessary to normalize the values of attributes that

are measured on substantially different scales (e.g., "age" and "income"). While WEKA provides

filters to accomplish all of these preprocessing tasks, they are not necessary for clustering in

WEKA . This is because WEKA SimpleKMeans algorithm automatically handles a mixture of

categorical and numerical attributes. Furthermore, the algorithm automatically normalizes

numerical attributes when doing distance computations. The WEKA SimpleKMeans algorithm

uses Euclidean distance measure to compute distances between instances and clusters.

To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button.

This results in a drop down list of available clustering algorithms. In this case we select

"SimpleKMeans". Next, click on the text box to the right of the "Choose" button to get the pop-

up window shown in Figure 2, for editing the clustering parameter.

Figure 2

47 | P a g e

In the pop-up window we enter 6 as the number of clusters (instead of the default values of 2)

and we leave the value of "seed" as is. The seed value is used in generating a random number

which is, in turn, used for making the initial assignment of instances to clusters. Note that, in

general, K-means is quite sensitive to how clusters are initially assigned. Thus, it is often

necessary to try different values and evaluate the results.

Once the options have been specified, we can run the clustering algorithm. Here we make sure

that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start".

We can right click the result set in the "Result list" panel and view the results of clustering in a

separate window. This process and the resulting window are shown in Figures 3 and 4.

Figure 3

48 | P a g e

Figure 4

The result window shows the centroid of each cluster as well as statistics on the number and

percentage of instances assigned to different clusters. Cluster centroids are the mean vectors for

each cluster (so, each dimension value in the centroid represents the mean value for that

dimension in the cluster). Thus, centroids can be used to characterize the clusters. For example,

the centroid for cluster 1 shows that this is a segment of cases representing middle aged to young

(approx. 38) females living in inner city with an average income of approx. $28,500, who are

married with one child, etc. Furthermore, this group have on average said YES to the PEP

product.

Another way of understanding the characteristics of each cluster in through visualization. We can

do this by right-clicking the result set on the left "Result list" panel and selecting "Visualize

cluster assignments". This pops up the visualization window as shown in Figure 5.

49 | P a g e

Figure 5

You can choose the cluster number and any of the other attributes for each of the three different

dimensions available (x-axis, y-axis, and color). Different combinations of choices will result in

a visual rendering of different relationships within each cluster. In the above example, we have

chosen the cluster number as the x-axis, the instance number (assigned by WEKA) as the y-axis,

and the "sex" attribute as the color dimension. This will result in a visualization of the

distribution of males and females in each cluster. For instance, you can note that clusters 2 and 3

are dominated by males, while clusters 4 and 5 are dominated by females. In this case, by

changing the color dimension to other attributes, we can see their distribution within each of the

clusters.

Finally, we may be interested in saving the resulting data set which included each instance along

with its assigned cluster. To do so, we click the "Save" button in the visualization window and

save the result as the file "bank-kmeans.arff". The top portion of this file is depicted in Figure 6.

50 | P a g e

Figure 6

Note that in addition to the "instance_number" attribute, WEKA has also added "Cluster"

attribute to the original data set. In the data portion, each instance now has its assigned cluster as

the last attribute value. By doing some simple manipulation to this data set, we can easily convert

it to a more usable form for additional analysis or processing. For example, here we have

converted this data set in a comma-separated format and sorted the result by clusters.

Furthermore, we have added the ID field from the original data set (before sorting). The results

of these steps can be seen in the file "bank- kmeans.csv".