data warehousing and data mining file

G.B. PANT ENGINEERING COLLEGE

PROJECT FILEOF

DATA WAREHOUSE & DATA MININGG.G.S.I.P.University, DELHI

SUBMITTED TO: SUBMITTED BY:Ms. MAMTA MITTAL RAVI KUMAR

Prof. DWDM C.S.E ( 6TH SEM )

Roll No.: 06420902712

06420902712 DATA WAREHOUSE & DATA MINING

DATA WAREHOUSE AND DATA MINING

List of Experiments

1. Compare the functionality of File Management System, Database Management System, Data

Warehousing System.

2. Is to implement Joins with help of oracle.

3. To perform indexing on a table.

4. To write a program showing Associative Rule Mining (Market Basket Rule).

5. To study the basics of WEKA software.

6. To implement clustering in WEKA.

7. Explain how to implement classification in WEKA.

8. To implement Apriori Algorithm using WEKA.

9. Explain how to implement Association in WEKA.

Page 1


Experiment No. 1

AIM: - Compare the functionality of File Management System, Database Management System, Data Warehousing System.

THEORY: - File Management Vs Database Management System Vs Data Warehousing System.

File Management System

The data that we work with on computers is kept in a hierarchical file system in which directories have files and subdirectories beneath them. Although we use the computer operating system to keep our image data organized, how we name files and folders, how we arrange these nested folders, and how we handle the files in these folders are the fundamental aspects of file management. The operating system's organization of our data can be enhanced by the use of cataloguing programs, which make organizing and finding image files easier than simply relying on the computer's directory structure. Another feature of catalogue programs is that they can streamline backup procedures for better file protection.

Advantages: - Immediate availability of Information compared to Manual tracking Easy retrieval of reports with minimum input data Stock of Pending files Availability of Reports on various dimensions (based on different parameters) Quick delivery of detailed reports A File is indexed for faster and easier retrieval .Gives structure to all your information. File Management system will allow users to create and store meta data.

Disadvantages: - Program data independence – file descriptions are stored within each application program that

accesses a given file There is data redundancy It cannot generate queries Duplication of Data – application are developed independently in file processing system

leading to unplanned duplicate files which leads to wastage of memory and changes in one file must be made manually in all files which results in loss of data integrity.

Limited data sharing - Each application has its own private files with little opportunity to share data outside their own applications. A requested report may require data from several incompatible files in separate systems.

Lengthy Development Times- There is little opportunity to leverage previous development

Database Management System

A DBMS is a collection of programs that enables users to create and maintain a database. Defining a database involves specifying the data types , structures , and constraints for the data to be stored in the database. It also defines rules to validate and manipulate this data. A DBMS relieves users of framing programs for data maintenance. Fourth-generation query languages, such as SQL, are used along with the DBMS package to interact with a database. A database is the back-end of an application. A DBMS receives instruction from a DBA and accordingly instructs the system to make the necessary changes. These commands can be to load, retrieve or modify existing data from the system. A DBMS always provides data independence. Any change in storage mechanism and formats are performed without modifying the entire application

Page 2


Advantages: - Controlling redundancy – providing storage structure for efficient query processing. Restricting unauthorized users Providing concurrency Providing backup and recovery. Enforcing integrity constraints

Disadvantages: - Centralization: That is use of the same program at a time by many user sometimes lead to loss

of some data. High cost of software. Technical expertise are required power dependency Reporting features like charts of a spreadsheet like Excel may not be available in RDBMS

Data Warehousing System

A data warehousing system is database geared towards the business intelligence requirement of an organisation. The Data Warehouse integrates data from the various operational systems and is typically loaded from these systems at regular intervals. Data Warehousing contains historical information that enables analysis of business performance over the time.

Key Features: -1. Subject Oriented Data: -

The data is stored by subjects, not by applications. Data is stored by business subjects where business subjects are subjects critical for the enterprise. For a manufacturing company, sales, shipments, and inventory are critical business subjects . For a retail store , sales at the check out counter is a critical subject .

2. Integrated Data: -The data in the data warehouse comes from several operational systems which are disparate in nature that means their operational platforms and operating systems could be different . Before moving the data into the data warehouse, you have to go through a process of transformation , consolidation, and integration of the source data .

3. Time-Variant Data: -The data in the data warehouse is meant for analysis and decision making so because of the very nature of its purpose it has to contain historical data along with the current values. Data is stored as snapshots over past and current periods. The time variant nature of data warehouse

Allows for analysis of the past Relates information to the present Enables forecasts for the future

4. Non Volatile Data: -Once the data is captured in data warehouse, you do not run individuals transactions to change the data there. Data updates are commonplace in an operational database; not so in a data warehouse. The data in a data warehouse is not as volatile as the data in an operational database is. The data in a data warehouse is primarily for query and analysis.

5. Data Granularity(General point): -

Page 3


In data warehouse it is efficient to keep data summarized at different levels. Depending on the query , you can then go to the level of detail. The lower the level of detail, the finer the data granularity . If you want to keep data in the lowest level of detail, you have to store a lot of data in data warehouse. You will have to decide on the granularity levels based on the data types and the expected system performance for queries.

Advantages: - A data warehouse provides a common data model for all data of interest, regardless of the

data's source. Prior to loading data into the data warehouse, inconsistencies are identified and resolved. This

greatly simplifies reporting and analysis. Information in the data warehouse is under the control of data warehouse users so that, even if

the source system data is purged over time, the information in the warehouse can be stored safely for extended periods of time.

Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down operational systems.

Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems.

Integrating data from multiple sources Performing new types of analyses Reducing cost to access historical data. Standardizing data across the organization, a "single version of the truth" Improving turnaround time for analysis and reporting Sharing data and allowing others to easily access data Supporting ad hoc reporting and inquiry

Disadvantages: - Over their life, data warehouses can have high costs. The data warehouse is usually not static.

Maintenance costs are high. Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal

information to the organization. There is often a fine line between data warehouses and operational systems. Duplicate,

expensive functionality may be developed. Or, functionality may be developed in the data warehouse that, in retrospect, should have been developed in the operational systems and vice versa.

Major data schema transforms from each of the data sources to one schema in the data warehouse, which can represent more than 50% of the total data warehouse effort

Data owners lose control over their data, raising ownership (responsibility and accountability), security and privacy issues

Long initial implementation time and associated high cost Adding new data sources takes time and associated high cost Limited flexibility of use and types of users - requires multiple separate data marts for multiple

uses and types of users Typically, data is static and dated Typically, no data drill-down capabilities Difficult to accommodate changes in data types and ranges, data source schema, indexes and

queries Typically, cannot actively monitor changes in data

Example

Page 4


1. Implementation of DBMS to store dummy company details. Query-create a table employee containing employee id,name & department.

Employee Table

After executing desc employee query on the table

Department Table

After executing desc department query on the table

Page 5


2. Implementation of storage company details using traditional file processing system.

Experiment No. 2

Page 6


AIM: - Is to implement Joins with help of oracle

THEORY: - Introduction to joinsAn SQL join clause combines records from two or more tables in a database. It creates a set

that can be saved as a table or used as it is. A JOIN is a means for combining fields from two tables by using values common to each. ANSI-standard SQL specifies five types of JOIN: INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS. As a special case, a table (base table, view, or joined table) can JOIN to itself in a self-join.

We are going to use the following tables : -

Employee Table Department Table

Cross join: - It returns the Cartesian product of rows from tables in the join. In other words, it will produce rows which combine each row from the first table with each row from the second table.

SELECT * FROM employee CROSS JOIN department ;

Output after the execution of cross join

Inner Join An 'inner join' is a commonly used join operation used in applications. It can only be safely used in a database that enforces referential integrity or where the join fields are guaranteed not to be

Page 7

http://en.wikipedia.org/wiki/View_(database)

http://en.wikipedia.org/wiki/Field_(computer_science)

http://en.wikipedia.org/wiki/Database

http://en.wikipedia.org/wiki/Table_(database)

http://en.wikipedia.org/wiki/Row_(database)

http://en.wikipedia.org/wiki/SQL


NULL. The choice to use an inner join depends on the database design and data characteristics. A left outer join can usually be substituted for an inner join when the join field in one table may contain NULL values.

Inner join creates a new result table by combining column values of two tables (A and B) based upon the join-predicate. The query compares each row of A with each row of B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is satisfied, column values for each matched pair of rows of A and B are combined into a result row. The result of the join can be defined as the outcome of first taking the Cartesian product (or Cross join) of all records in the tables (combining every record in table A with every record in table B) and then returning all records which satisfy the join predicate.

SELECT * FROM employee INNER JOIN department ON employee.DepartmentID = deparment.DepartmentID ;

After the execution of inner join

Equi-join: - An equi-join is a specific type of comparator-based join, that uses only equality comparisons in the join-predicate. Using other comparison operators (such as <) disqualifies a join as an equi-join.

SELECT * FROM employee JOIN department ON employee.DepartmentID = deparment.DepartmentID ;

After the execution of equi-join

Natural join: - A natural join is a type of equi-join where the join predicate arises implicitly by comparing all columns in both tables that have the same column-names in the joined tables. The resulting joined table contains only one column for each pair of equally named columns.

Page 8


Most experts agree that NATURAL JOINs are dangerous and therefore strongly discourage their use. The danger comes from inadvertently adding a new column, named the same as another column in the other table. An existing natural join might then "naturally" use the new column for comparisons, making comparisons/matches using different criteria (from different columns) than before. Thus an existing query could produce different results, even though the data in the tables have not been changed, but only augmented.

SELECT * FROM employee NATURAL JOIN department ;

After the execution of natural join

Outer Join: - An outer join does not require each record in the two joined tables to have a matching record. The joined table retains each record—even if no other matching record exists. Outer joins subdivide further into left outer joins, right outer joins, and full outer joins, depending on which table's rows are retained (left, right, or both). Types of Outer Join

1. Left Outer Join : - The result of a left outer join (or simply left join) for tables A and B always contains all records of the "left" table (A), even if the join-condition does not find any matching record in the "right" table (B). This means that if the ON clause matches 0 (zero) records in B (for a given record in A), the join will still return a row in the result (for that record)—but with NULL in each column from B. A left outer join returns all the values from an inner join plus all values in the left table that do not match to the right table.

SELECT * FROM employee LEFT OUTER JOIN department ON employee.DepartmentID = deparment.DepartmentID ;

After the execution of left outer join query2. Right Outer Join : -A right outer join (or right join) closely resembles a left outer join, except

with the treatment of the tables reversed. Every row from the "right" table (B) will appear in the joined table at least once. If no matching row from the "left" table (A) exists, NULL will appear in columns from A for those records that have no match in B.A right outer join returns

Page 9


all the values from the right table and matched values from the left table (NULL in the case of no matching join predicate).

SELECT * FROM employee RIGHT OUTER JOIN department ON employee.DepartmentID = deparment.DepartmentID ;

After the execution of right outer join

3. Full Outer Join : - Conceptually, a full outer join combines the effect of applying both left and right outer joins. Where records in the FULL OUTER JOINED tables do not match, the result set will have NULL values for every column of the table that lacks a matching row. For those records that do match, a single row will be produced in the result set (containing fields populated from both tables).

SELECT * FROM employee FULL OUTER JOIN department ON employee.DepartmentID = deparment.DepartmentID ;

After the execution of full outer join

Experiment No. 3

AIM: - To perform indexing on a table.

Page 10


THEORY: - An index is an optional structure, associated with the table or table cluster that can sometimes speed data access. By creating an index one or more columns of a table, you gain the ability in some cases to retrieve a small set of randomly distributed rows from the table. Indexes are one of many means of reducing disk Input – Output. If a key organised table has no indexes, then the database must perform a full table scan to find a value.In general consider creating an index on a column in any of the following situations.

1. The indexed columns are query frequently and return a small percentage of the total no. of rows in the table.

2. A referential integrity key exists on the indexed column. The index is a mean to avoid a full table lock that would otherwise be read if you update the parent table primary key, merge into the parent table, or delete from the parent table.

3. A unique key constraint will be placed on the table and you have to manually specify the index and all index options.

Index Characteristics : - Indexes are schema objects that are logically and physically independent of the data in the objects that are logically and physically independent of the data in the objects in which they are associated thus an index can be dropped without physically affecting the table for the index.

Database Creation of a Company

Employee Table

Create query :- create table employee (id number(10) primary key,name varchar(20),age number(3),designation varchar(20),dnumber number(1));

Insert Query :- insert into employee values (123456789,'John',25,'Admin',5);insert into employee values (333445555,'Franklin',27,'Manager',5);insert into employee values (999887777,'Alicia',27,'CEO',4);insert into employee values (987654321,'Jennifer',28,'Manager',4);insert into employee values (666884444,'Ramesh',28,'Engineer',5);insert into employee values (453453453,'Joyce',29,'CEO',5);insert into employee values (888665555,'Raj',30,'Manager',1); Display Query:-Select * from employee

Page 11


Department TableCreate Query:-create table department(dnumber number(1) primary key, dname varchar(20), manager_id references employee(id));

Insert Query:-insert into department values(1, 'Finance', 888665555);insert into department values(4, 'HR', 987654321);insert into department values(5, 'Engineering', 333445555);

Display Query:- Select * from department

Project TableCreate Query:- create table project (pname varchar(20),pnumber number(2),Plocation varchar(20),Dnum references number(1));

Insert Query:-insert into project values('ProductX',1,'Bellaire',5);insert into project values('ProductY',2,'Sugarland',5);insert into project values('ProductZ',3,'Houston',5);insert into project values('Computerization',10,'Stafford',4);insert into project values('Recognization',20,'Houston',1);insert into project values('Newbenifits',30,'Stafford',4);

Display Query:- select * from project

Indexing implementation

Query :- create unique index emp_id on employee(id)Query retrieve :- select /*+ index(employee (id)) */ * from employee where id = 123456789

Page 12


Experiment No. 4AIM: To write a program showing Associative Rule Mining (Market Basket Rule).

THEORY: One of the major technologies in data mining involves the discovery of association rules . The database is regarded as a collection of transactions, each involving a set of items. A common example is that of market – basket data . Here the market basket corresponds to the sets of items a consumer buys in a supermarket during one visit.

An association rule is of the form X=>Y , where X={x1,x2,…..xn}, and Y={y1,y2,….ym} are sets of items, with xi and yj being distinct items for all i and j . This association states that if a customer buys S , he or she is also likely to buy Y. In general, any association rule has the form LHS (left –hand side) => RHS (right-hand side), where LHS RHS are sets of items. The set LHS U RHS is called an itemset, the set of items purchased by customers. For an association rule to be of interest to a data miner, the rule should satisfy some interest measure. Two common interest measures are support and confidence.

The support for a rule LHS => RHS is with respect to the itemset; it refers to how frequently a specific itemset occurs in the database . That is , the support is the percentage of transactions that contain all of the items in the itemset, LHS U RHS .If the support is low, it implies that there is no overwhelming evidence that items in LHS U RHS occur together because the itemset occurs in only a small fraction of transactions. Another term for support is prevalence of the rule.

The confidence is with regard to the implication show in the rule. The confidence of the rule LHS => RHS is computed as the support (LHS U RHS )/support(LHS).We can think of it as the probability that the items in RHS will be purchased give that the items in LHS are purchased by a customer . Another term for confidence is strength of the rule.

Program code:-

#include<iostream>#include<stdio.h>using namespace std;int main(){ int *t1, *t2, *t3, *t4, count, n, temp = 1, num1, num2; float uni = 0.0, left = 0.0; int na1[4],na2[4]; for(int i = 0; i<4 ; ++i) { na1[i] = 0; na2[i] = 0; } printf("Enter the maximum size of a transaction: \n"); scanf("%d",&n); t1 = new int[n]; t2 = new int[n]; t3 = new int[n]; t4 = new int[n]; printf("Enter the elements of 1st transaction: (enter 0 to exit) \n"); for( int i = 0; i<n ; ++i) { scanf("%d",&temp); if(temp == 0) break; else t1[i] = temp; } printf("Enter the elements of 2nd transaction: (enter 0 to exit) \n"); for(int i = 0; i<n ; ++i) { scanf("%d",&temp); if(temp == 0) break; else t2[i] = temp; } printf("Enter the elements of 3rd transaction: (enter 0 to exit) \n"); for(int i = 0; i<n ; ++i)

Page 13


{ scanf("%d",&temp); if(temp == 0) break; else t3[i] = temp; } printf("Enter the elements of 4th transaction: (enter 0 to exit) \n"); for(int i = 0; i<n ; ++i) { scanf("%d",&temp); if(temp == 0) break; else t4[i] = temp; } printf("Enter the rule: \n"); scanf("%d %d",&num1,&num2); for(int i = 0; i<n ; ++i) { if(t1[i] == num1) na1[0] = 1; else if(t1[i] == num2) na2[0] = 1; else; if(t2[i] == num1) na1[1] = 1; else if(t2[i] == num2) na2[1] = 1; else; if(t3[i] == num1) na1[2] = 1; else if(t3[i] == num2) na2[2] = 1; else; if(t4[i] == num1) na1[3] = 1; else if(t4[i] == num2) na2[3] = 1; else; } for(int i = 0; i<4 ; ++i) { if(na1[i] == 1&&na2[i] == 1) uni++; if(na1[i] == 1) left++; } printf("support for the given rule: %f% \n",(uni/4)*100); printf("confidence for the given rule: %f% \n",(uni/left)*100); return 0;}

Output:-

Page 14


Experiment No -5

AIM :- To study the basics of WEKA software.

THEORY :- WEKA stands for “Waikato Environment for Knowledge Analysis” which was developed at the University of Waikato in New Zealand. WEKA is extensible and has become a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost every platform. WEKA contains several data mining tools including classifiers, clustering algorithms, association rules and functions. WEKA also has nice graphical features and many pre-processing capabilities. Some of the more popular WEKA tools include: decision trees, production rules, the apriori algorithm, Bayes’ classifier, linear regression, logistic regression, and the K-Means algorithm.

MAIN FEATURES 49 data pre-processing tools 76 classification/regression algorithms 8 clustering algorithms 15 attribute/subset evaluators + 10 search algorithms for feature selection. 3 algorithms for finding association rules 3 graphical user interfaces

“The Explorer” (exploratory data analysis) “The Experimenter” (experimental environment) “The Knowledge Flow” (new process model inspired interface)In Weka, you can read files in a variety of formats -WEKA’s ARFF format, CSV format, C4.5 format, or serialized Instances format. ARFF files typically have a .arff extension, CSV files a .csv extension, C4.5 files a .data and .names extension, and serialized Instances objects a .bsi extension.The data files normally used by Weka are in ARFF file format, or in CSV format.

ARFF Format:The ARFF format consists of special tags to indicate different things in the data file (foremost: attribute names, attribute types, attribute values and the data). An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the Department of Computer Science of The University of Waikato for use with the Weka machine learning software. ARFF files have two distinct sections. The first section is the Header information, which is followed by the Data information.

CSV FormatCSV stands for "Comma Separated Value". It is a standard means of exchanging data, and virtually all spreadsheets and databases can import data presented in this fashion. The comma-separated values file format is a set of file formats used to store tabular data in which numbers and text are stored in plain textual form that can be read in a text editor. Lines in the text file represent rows of a table, and commas in a line separate what are fields in the table’s row. CSV is a simple file format that is widely supported, so it is often used to move tabular data between different computer programs that support compatible CSV formats. For example: a CSV file might be used to transfer information from a database program to a spreadsheet.

Page 15


Experiment No. 6

AIM :- To study the GUI of WEKA.

THEORY :- LAUNCHING WEKAThe Weka GUI Chooser provides a starting point for launching Weka’s main GUI applications and

supporting tools. The GUI Chooser consists of four buttons—one for each of the four major Weka applications—and four menus.

The buttons can be used to start the following applications: Explorer: An environment for exploring data with WEKA. Experimenter: An environment for performing experiments and conducting statistical tests

between learning schemes. Knowledge Flow: This environment supports essentially the same functions as the Explorer

but with a drag-and-drop interface. One advantage is that it supports incremental learning. Simple CLI: Provides a simple command-line interface that allows direct execution of WEKA

commands for operating systems that do not provide their own command line interface.The first interface for WEKA will appear as shown below:

EXPLORERSection TabsAt the very top of the window, just below the title bar, is a row of tabs. When the Explorer is first started only the first tab is active; the others are greyed out. This is because it is necessary to open (and potentially pre-process) a dataset before starting to explore the data.The tabs are as follows:1. Pre-process . Choose and modify the data being acted on.2. Classify . Train and test learning schemes that classify or perform regression.3. Cluster . Learn clusters for the data.4. Associate . Learn association rules for the data.5. Select attributes . Select the most relevant attributes in the data.6. Visualize . View an interactive 2D plot of the data.Once the tabs are active, clicking on them flicks between different screens, on which the respective actions can be performed. The bottom area of the window (including the status box, the log button, and the Weka bird) stays visible regardless of which section you are in.

Page 16


Weka Knowledge Explorer

The Weka Knowledge Explorer is an easy to use graphical user interface that harnesses the power of the weka software. Each of the major weka packages Filters, Classifiers, Clusterers, Associations, and Attribute Selection is represented in the Explorer along with a Visualization tool which allows datasets and the predictions of Classifiers and Clusterers to be visualized in two dimensions.

Preprocess PanelThe preprocess panel is the start point for knowledge exploration. From this panel you can load datasets, browse the characteristics of attributes and apply any combination of Weka's unsupervised filters to the data.

Classifier Panel : - The classifier panel allows you to configure and execute any of the weka classifiers on the current dataset. You can choose to perform a cross validation or test on a separate dataset. Classification errors can be visualized in a pop-up data visualization tool. If the classifier produces a decision tree it can be displayed graphically in a pop-up tree visualizer.

Cluster Panel :- From the cluster panel you can configure and execute any of the weka clusterers on the current dataset. Clusters can be visualized in a pop-up data visualization tool.

Page 17


Associate Panel :- From the associate panel you can mine the current dataset for association rules using the weka associators.

Visualize Panel :- This panel displays a scatter plot matrix for the current dataset. The size of the individual cells and the size of the points they display can be adjusted using the slider controls at the bottom of the panel. The number of cells in the matrix can be changed by pressing the "Select Attributes" button and then choosing those attributes to displayed. When a dataset is large, plotting performance can be improved by displaying only a subsample of the current dataset. Clicking on a cell in the matrix pops up a larger plot panel window that displays the view from that cell. This panel allows you to visualize the current dataset in one and two dimensions. When the colouring attribute is discrete, each value is displayed as a different colour; when the colouring attribute is continuous, a spectrum is used to indicate the value. Attribute "bars" (down the right hand side of the panel) provide a convenient summary of the discriminating power of the attributes individually. This panel can also be popped up in a separate window from the classifier panel and the cluster panel to allow you to visualize predictions made by classifiers/clusterers. When the class is discrete, misclassified points are shown by a box in the colour corresponding to the class predicted by the classifier; when the class is

Page 18


continuous, the size of each plotted point varies in proportion to the magnitude of the error made by the classifier.

Interactive decision tree construction :- Weka has a novel interactive decision tree classifier (weka.classifiers.trees.UserClassifier). Through an intuitive, easy to use graphical interface, UserClassifier allows the user to manually construct a decision tree by definining bi-variate splits in the instance space. The structure of the tree can be viewed and revised at any point in the construction phase.

Page 19


Experiment No.7

AIM :- To implement clustering in WEKA.

THEORY :- When clustering is accomplished by modelling the distribution of instances probabilistically, it is possible to check how well the model fits the data by computing the likelihood of a set of test data given the model. Weka measures goodness-of-fit by the logarithm of the likelihood, or loglikelihood: and the larger this quantity, the better the model fits the data. Instead of using a single test set, it is also possible to compute a cross validation estimate of the log-likelihood using –x. Weka also outputs how many instances are assigned to each cluster. For clustering algorithms that do not model the instance distribution probabilistically, these are the only statistics that Weka outputs. It’s easy to find out which clusters generate a probability distribution: they are subclasses of weka.clusterers.DistributionClusterer.

There are two clustering algorithms in weka.clusterers: weka.clusterers.EM and weka.clusterers.Cobweb.

The former is an implementation of the EM algorithm and the latter implements the incremental clustering algorithm. They can handle both numeric and nominal attributes.

Clustering techniques are used for combining observed examples into clusters (groups) which satisfy two main criteria:1. Each group or cluster is homogenous; examples that belong to the same group are similar to each

other.2. Each group or cluster should be different from other cluster that is examples that belong to one

cluster should be different from the examples of other clusters.

Procedure:1. Choose the weka explorer

Page 20


2. In preprocessor choose file with .arff extension from data

3. Select the attributes which you have to analyse

4. Go to the cluster panel on the top and choose the clusterers.

Page 21


5. start clustering

6. Then visualize the cluster by the visualize panel on the top.

Page 22


Experiment No.8AIM :- Explain how to implement classification in WEKA.THEORY :- Classification is a data mining (machine learning) technique used to predict group membership for data instances. Classification is a different technique than clustering. Classification is similar to clustering in that it also segments customer records into distinct segments called classes. But unlike clustering, a classification analysis requires that the end-user/analyst know ahead of time how classes are defined. For example, classes can be defined to represent the likelihood that a customer defaults on a loan (Yes/No). It is necessary that each record in the dataset used to build the classifier already have a value for the attribute used to define classes. Because each record has a value for the attribute used to define the classes, and because the end-user decides on the attribute to use, classification is much less exploratory than clustering. The objective of a classifier is not to explore the data to discover interesting segments, but rather to decide how new records should be classified. Classification routines in data mining also use a variety of algorithms -- and the particular algorithm used can affect the way records are classified. A common approach for classifiers is to use decision trees to partition and segment records. New records can be classified by traversing the tree from the root through branches and nodes, to a leaf representing a class. The path a record takes through a decision tree can then be represented as a rule.The text in the Classifier output area has scroll bars allowing you to browse the results. Clicking with the left mouse button into the text area, while holdingAlt and Shift, brings up a dialog that enables you to save the displayed output in a variety of formats (currently, JPEG and EPS). Of course, you can also resize the Explorer window to get a larger display area. The output is split into several sections:1. Run information . A list of information giving the learning scheme options, relation name,

instances, attributes and test mode that were involved in the process.2. Classifier model (full training set) . A textual representation of the classification model that was

produced on the full training data.

The results of the chosen test mode are broken down thus:1. Summary . A list of statistics summarizing how accurately the classifier was able to predict the true

class of the instances under the chosen test mode.2. Detailed Accuracy By Class . A more detailed per-class break down of the classifier’s prediction

accuracy.3. Confusion Matrix . Shows how many instances have been assigned to each class. Elements show

the number of test examples whose actual class is the row and whose predicted class is the column.

Procedure:1. Select the explorer box

Page 23


2. In preprocessor choose file with .arff extension from data

3. Choose the classify panel on the top & the classifier as j48 in trees.

Page 24


4. Start the classification in cross validation test option.

5. Select visualize tree for the result.

Page 25


Experiment No : 9AIM :- Explain how to implement Association in WEKA.

THEORY :- Association rule mining is a technique for discovering unsuspected data dependencies and is one of the best known data mining techniques. The basic idea is to identify from a given database, consisting of itemsets (e.g. shopping baskets), whether the occurrence of specific items, implies also the occurrence of other items with a relatively high probability. In principle the answer to this question could be easily found by exhaustive exploration of all possible dependencies, which is however prohibitively expensive. Association rule mining thus solves the problem of how to search efficiently for those dependencies.

Procedure:1. Select the explorer box of weka.

2. Choose the file with extension .arff from data.

Page 26


3. After choosing the file go to associate option and choose the filter

4. After choosing the filter start the process

Page 27

data warehousing and data mining file

Documents

image data

data redundancy

data types

file processing system

hierarchical file system

store meta data

limited data sharing

loss of data integrity