weka

EXPERIMENT-1

AIM:TO STUDY OF WEKA TOOLS

Weka is a collection of machine learning algorithms for data mining tasks .The algorithms can be applied directly to a dataset. Weka contains tools for data preprocessing, classification, regression, clustering, association rulers, and visualization. It is also well suited for developing new machine learning schemes

There are four options available on this initial screen.

1. Simple CLI-provide uses without a graphic interface option the ability to execute commands from a terminal window.

2. Explorer-the graphical interfaces used to conduct experimentation on raw data3. Experimenter- this option allow user to conduct different experimental variation on data

sets and perform statistical manipulation4. Knowledge flow –bascially the same functionality as explorer with drag and drop

functionality.

There are six tabs in the explorer:

Preprocess- used to choose and modify the data being acted on. Classify-Train and test learning schemes that classify or perform regression. Cluster-used to apply different tool that identify clusters within the data file. Associate-used to appy different rules to data file that identify association within the data. Select attributes-Select the most relevant attributes in the data. Visualize-used to see what the various manipulation produced on data set in a 2D format

in scatter plot and bar group output.

PREPROCESSING:

In order to experiment with the application, data set needs to be presented to WEKA in a format that the program understands . There are rules for the type of data that WEKAwill accept . There are three option for opton for presenting data into the program.

Open file- allows for the user to select files residing on local machine or recoreded medium

Open URL-provides a mechanism to locate a file or data source from a different location specified by the user

Open Database –allows the user to retrieve files or data from a database source provided by the user.

ARFF FILE:

It is an external representation of an instance class . It consist of Aheader: Describes the attribute type Data section:Common separated list of data .

Each entry in a dataset is an instance of the java class:- weka.core.instance.each instance consists of a number of attributes.

Types of attributes:

1.numeric:one of a predefined list of value-e.g red, green,blue 2.nominal:A real or integer number3.string:enclosed in “double quote” or “single quote” 4.date5.relational

PROGRAM@relation gradessystem

@attribute@attribute@attribute@attribute

grade {'grade A','grade B','grade C'} totalscore realremark{ excellence,good,poor}result{PASS,FAIL}

@data 'grade'grade 'grade 'grade'grade 'grade 'grade 'grade

A',97,excellence,PASS B',58,good,PASS C',33,poor,FAILA',88,excellence,PASSC',18,poor,FAIL C',29,poor,FAIL B',65,good,PASS C',32,poor,FAILOUTPUT

EXPERIMENT-3

AIM- Classification of data through weka

3.1.CLASSIFY- Classification (also known as classification trees or decision trees) is a data mining algorithm that creates a step-by-step guide for how to determine the output of a new data instance. The tree it creates is exactly that: a tree whereby each node in the tree represents a spot where a decision must be made based on the input, and you move to the next node and the next until you reach a leaf that tells you the predicted output. Sounds confusing, but it's really quite straightforward. Let's look at an example.fig 3.1

Simple classification tree

[ Will You Read This Section? ]/ \

Yes No/ \

[Will You Understand It?] [Won't Learn It]/ \

YesNo/ \[Will Learn It]

[Won't Learn It]Figure-3.1

This simple classification tree seeks to answer the question "Will you understand classification trees?" At each node, you answer the question and move on that branch, until you reach a leaf that answers yes or no. This model can be used for any unknown data instance, and you are able to predict whether this unknown data instance will learn classification trees by asking them only two simple questions. That's seemingly the big advantage of a classification tree — it doesn't require a lot of information about the data to create a tree that could be very accurate and very informative.One important concept of the classification tree is similar to what we saw in the regression model from Part 1: the concept of using a "training set" to produce the model. This takes a data set with known output values and uses this data set to build our model. Then, whenever we have a new data point, with an unknown output value, we put it through the model and produce our expected output. This is all the same as we saw in the regression model. However, this type of model takes it one step further, and it is common practice to take an entire training set and divide it into two parts: take about 60-80 percent of the data and put it into our training set, which we will use to

http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html

http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html

create the model; then take the remaining data and put it into a test set, which we'll use immediately after creating the model to test the accuracy of our model.

Why is this extra step important in this model? The problem is called overfitting: If we supply too much data into our model creation, the model will actually be created perfectly, but just for that data. Remember: We want to use the model to predict future unknowns; we don't want the model to perfectly predict values we already know. This is why we create a test set. After we create the model, we check to ensure that the accuracy of the model we built doesn't decrease with the test set. This ensures that our model will accurately predict future unknown values. We'll see this in action using WEKA.

This brings up another one of the important concepts of classification trees: the notion of pruning. Pruning, like the name implies, involves removing branches of the classification tree. Why would someone want to remove information from the tree? Again, this is due to the concept of overfitting. As the data set grows larger and the number of attributes grows larger, we can create trees that become increasingly complex. Theoretically, there could be a tree with leaves = (rows * attributes). But what good would that do? That won't help us at all in predicting future unknowns, since it's perfectly suited only for our existing training data. We want to create a balance. We want our tree to be as simple as possible, with as few nodes and leaves as possible. But we also want it to be as accurate as possible. This is a trade-off, which we will see.

Finally, the last point I want to raise about classification before using WEKA is that of false positive and false negative. Basically, a false positive is a data instance where the model we've created predicts it should be positive, but instead, the actual value is negative. Conversely, a false negative is a data instance where the model predicts it should be negative, but the actual value is positive.These errors indicate we have problems in our model, as the model is incorrectly classifying some of the data. While some incorrect classifications can be expected, it's up to the modelcreator to determine what is an acceptable percentage of errors. For example, if the test were for heart monitors in a hospital, obviously, you would require an extremely low error percentage. On the other hand, if you are simply mining some made-up data in an article about data mining, your acceptable error percentage can be much higher. To take this even one step further, you need to decide what percent of false negative vs. false positive is acceptable. The example that immediately comes to mind is a spam model: A false positive (a real e-mail that gets labeled as spam) is probably much more damaging than a false negative (a spam message getting labeled as

not spam). In an example like this, you may judge a minimum of 100:1 false negative:positive ratio to be acceptable.3.2 ZEROR :Here I choose ZERO R rule algorithm to classify data.The table below describes the capabilites of ZeroR.

Capability Supported

ClassNumeric class, Date class, Nominal class, Missing class values, Binary class

AttributesRelational attributes, Binary attributes, String attributes, Unary attributes, Date attributes, Missing values, Empty nominal attributes, Numeric attributes, Nominal attributes

Min #of instances

0

Figure-3.2By using zero-r algorithm we have following output, as we seen here confusion matrix.

3.3 CONFUSION MATRIX- In the field of machine learning, a confusion matrix,also known as a contingency table or an error matrix , is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).

Figure-3.3

The accuracy (AC) is the proportion of total no. of predictions that were correct.

AC= �+��+�+�+�

……….i)

1. The recall or true positive rate(TP) is the proportion of positive cases that were correctly identified as calculated using equation.

�+�

TP= � ……….ii)

2. The false positive rate that is FP is the proportion of negative cases that were incorrectly classified as positive as calculated using equation

�+�

FP= � ……….iii)

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Contingency_table

http://en.wikipedia.org/wiki/Supervised_learning

http://en.wikipedia.org/wiki/Supervised_learning

http://en.wikipedia.org/wiki/Unsupervised_learning

http://en.wikipedia.org/wiki/Unsupervised_learning

3. The true negative rate (TN) is define as proportion of negative cases that were classified Correctly,as calculated using equation

�+�

TN= � ……….iv)

4. The false negative rate (FN) is define as proportion of positive cases that were classified Incorrectly as negative calculated using equation

�+�

FN= � ……….v)

5. Finally,precision P is a proportion of predicted positive cases that were correct as calculated using equation

�+�

P= � ……….vi)

3.3 RESULT:

EXPERIMENT-4

AIM- Clustering of data through weka

4.1 CLUSTER- The Cluster mode box is used to choose what to cluster and how to evaluate the results . The first three options are the same as for classification: Use training set, Supplied test set and Percentage split except that now the data is assigned to clusters instead of trying to predict a specific class. The fourth mode, Classes to clusters evaluation, compares how well the chosen clusters match up with a pre-assigned class in the data. The drop-down box below this option selects the class, just as in the Classify panel. An additional option in the Cluster mode box, the Store clusters for visualization tick box, determines whether or not it will be possible to visualize the clusters once training is complete. When dealing with datasets that are so large that memory becomes a problem it may be helpful to disable this option.

Figure4.1

Figure shows the various instances and attributes of the relation also shows the training modes of the clusters.

If we visualize the above relation then we found a tree having nodes and following information.

Figure4.2

Figure contain some nodes and leaf shows instances and attributes of the relation.

4.2 Result :

EXPERIMENT-5

AIM- PERFORMING KNOWLEDGE FLOW ON DATA SET

1.Introduction to knowledge flow- The Knowledge Flow presents a "data-flow" inspired interface to Weka. The user can select Weka components from a tool bar, place them on a layout canvas and connect them together in order to form a "knowledge flow" for processing and analyzing data. At present, all of Weka's classifiers and filters are available in the Knowledge Flow along with some extra tools.The Knowledge Flow can handle data either incrementally or in batches (the Explorer handles

batch data only). Of course learning from data incrementally requires a classifier that can be updated on an instance by instance basis. Currently in Weka there are five classifiers that can handle data incrementally: Naive Bayes Updateable, IB1, IB k, LWR (locally weighted regression).

2. Features of the Knowledge Flow: intuitive data flow style layout process data in batches or incrementally process multiple batches or streams in parallel! (each separate flow executes in its own

thread) chain filters together view models produced by classifiers for each fold in a cross

validation visualize performance of incremental classifiers during processing (scrolling plots of

classification accuracy, RMS error, predictions etc)

3. Components available in the Knowledge Flow:1. Evaluation: Training Set Maker - make a data set into a training set Test Set Maker - make a data set into a test set Cross Validation Fold Maker - split any data set, training set or test set into folds Train Test Split Maker - split any data set, training set or test set into a training set and a

test set Class Assigner - assign a column to be the class for any data set, training set or test set Class Value Picker - choose a class value to be considered as the "positive" class. This is

useful when generating data for ROC style curves (see below). Classifier Performance Evaluator - evaluate the performance of batch trained/tested

classifiers Incremental Classifier Evaluator - evaluate the performance of incrementally trained

classifiers Prediction Appender - append classifier predictions to a test set. For discrete class

problems, can either append predicted class labels or probability distributions.

Visualization: Data Visualizer - component that can pop up a panel for visualizing data in a single large

2D scatter plot. Scatter Plot Matrix - component that can pop up a panel containing a matrix of small

scatter plots (clicking on a small plot pops up a large scatter plot). Attribute Summarizer - component that can pop up a panel containing a matrix of

histogram plots - one for each of the attributes in the input data. Model Performance Chart - component that can pop up a panel for visualizing threshold

(i.e. ROC style) curves. Text Viewer - component for showing textual data. Can show data sets, classification

performance statistics etc. Graph Viewer - component that can pop up a panel for visualizing tree based models Strip Chart - component that can pop up a panel that displays a scrolling plot of data

(used for viewing the online performance of incremental classifiers)

5.4 Start-up window in knowledge flow

Figure5.1

5.5 Formation of tools in knowledge flow-

Figure5.2

5.6 Result produced by text-vewier in knowledge flow-

Figure5.3

5.7 Graph list genrate by graph veiwer in knowledge flow-

Figure5.4

5.8 Final tree formed in knowledge flow-

Figure 5.5

5.9 Graph formed using model performance chart in knowledge flow-

Figure5.6

5.10 Result:

weka

Engineering