wek1

35
MaxQDPro Team Anjan.K Harish.R II Sem M.Tech CSE 06/07/22 Machine learning with WEKA 1 Machine Learning with Machine Learning with WEKA WEKA

Upload: anjan-krishnamurthy

Post on 27-Jan-2015

111 views

Category:

Technology


2 download

DESCRIPTION

A short description to weka data mining tool

TRANSCRIPT

Page 1: Wek1

MaxQDPro TeamAnjan.K Harish.R

II Sem M.Tech CSE

04/10/23 Machine learning with WEKA 1

Machine Learning with Machine Learning with WEKAWEKA

Page 2: Wek1

AgendaAgenda

04/10/23 2Machine learning with WEKA

Page 3: Wek1

Introduction to WEKAIntroduction to WEKAWaikato Environment for Knowledge Analysis

Weka is a collection of machine learning algorithms for data mining tasks.

Weka contains tools for data pre-processing,

classification, regression, clustering, association rules, and visualization.

Official Web Site: http://www.cs.waikato.ac.nz/ml/weka/

04/10/23 3Machine learning with WEKA

Page 4: Wek1

April 10, 2006 4

WEKA System HierarchyWEKA System Hierarchy

User Application

Model:SerializedObjects

Weka system

Basic support

weka.core

User Interface - weka.gui

DataBase/

Datawarehouse

Arff, Csv,C45

documents

Simple CLI Explorer ExperimenterKnowledge

FlowArffViewer

JDBC

Middle layer

Algorithms Evaluation supportsand UI supports

weka.classifiers

weka.estimators

weka.filtersweka.associations

weka.clusterers

weka.attributeSelectionweka.experiment

weka.datagenerator

Page 5: Wek1

Weka’s Weka’s RRole in the ole in the BBig ig PPictureicture

Input•Raw data

Input•Raw data

Data Mingby Weka

•Pre-processing •Classification•Regression •Clustering •Association Rules •Visualization

Data Mingby Weka

•Pre-processing •Classification•Regression •Clustering •Association Rules •Visualization

Output•Result

Output•Result

04/10/23 5Machine learning with WEKA

Page 6: Wek1

Machine learning with WEKA

KDD ProcessKDD Process

Data

Knowledge

Se lec tion

Preprocess ing

Trans fo rmation

Data Mining

Inte rpre ta tionEva lua tion

04/10/23 6

Page 7: Wek1

04/10/23 Machine learning with WEKA 7

WEKA: the softwareWEKA: the softwareMachine learning/data mining software

written in Java (distributed under the GNU Public License)

Used for research, education, and applications

Complements “Data Mining” by Witten & Frank

Main features:◦Comprehensive set of data pre-processing

tools, learning algorithms and evaluation methods

◦Graphical user interfaces (incl. data visualization)

◦Environment for comparing learning algorithms

Page 8: Wek1

04/10/23 Machine learning with WEKA 8

HistoryHistory Project funded by the NZ government since 1993

◦ Develop state-of-the art workbench of data mining tools

◦ Explore fielded applications◦ Develop new fundamental methods

Page 9: Wek1

04/10/23 Machine learning with WEKA 9

HistoryHistory July 1997 - WEKA 2.2

◦ Schemes: 1R, T2, K*, M5, M5Class, IB1-4, FOIL, PEBLS, support for C5

◦ Included a facility (based on Unix makefiles) for configuring and running large scale experiments

Early 1997 - decision was made to rewrite WEKA in Java◦ Originated from code written by Eibe Frank for his

PhD◦ Originally codenamed JAWS (JAJAva WWeka SSystem)

May 1998 - WEKA 2.3◦ Last release of the TCL/TK-based system

Mid 1999 - WEKA 3 (100% Java) released◦ Version to complement the Data Mining book◦ Development version (including GUI)

Page 10: Wek1

04/10/23 Machine learning with WEKA 10

WEKA: versionsWEKA: versionsThere are several versions of

WEKA:◦WEKA 3.4: “book version” compatible

with description in data mining book◦WEKA 3.5.5: “development version”

with lots of improvementsThis talk is based on a nightly

snapshot of WEKA 3.5.5 (12-Feb-2007)

With latest being WEKA 3.6 series

Page 11: Wek1

04/10/23 Machine learning with WEKA 11

java weka.gui.GUIChooser

Page 12: Wek1

Machine learning with WEKA

Explorer - Explorer - PreprocessingPreprocessingImport from files: ARFF, CSV, C4.5,

binaryImport from URL or an SQL database

(using JDBC)Preprocessing filters

◦Adding/removing attributes◦Attribute value substitution ◦Discretization (MDL, Kononenko, etc.)◦Time series filters (delta, shift)◦Sampling, randomization◦Missing value management◦Normalization and other numeric

transformations04/10/23 12

Page 13: Wek1

ARFF File FormatARFF File Format Require declarations of @RELATION, @ATTRIBUTE and @DATA @RELATION declaration associates a name with the dataset

◦ @RELATION <relation-name>@RELATION iris

@ATTRIBUTE declaration specifies the name and type of an attribute

◦ @attribute <attribute-name> <datatype>

◦ Datatype can be numeric, nominal, string or date@ATTRIBUTE sepallength NUMERIC @ATTRIBUTE petalwidth NUMERIC@ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA declaration is a single line denoting the start of the data segment

◦ Missing values are represented by ?@DATA 5.1, 3.5, 1.4, 0.2, Iris-setosa4.9, ?, 1.4, ?, Iris-versicolor

04/10/23 13Machine learning with WEKA

Page 14: Wek1

Machine learning with WEKA

Explorer - ClassificationExplorer - ClassificationPredicted attribute is categoricalImplemented methods

◦Naïve Bayes◦decision trees and rules◦neural networks◦support vector machines◦ instance-based classifiers …

Evaluation◦test set◦crossvalidation ...

04/10/23 14

Page 15: Wek1

J48 = Decision TreeJ48 = Decision Tree

petalwidth <= 0.6: Iris-setosa (50.0) : # under node

petalwidth > 0.6 # ..number wrong

| petalwidth <= 1.7| | petallength <= 4.9: Iris-versicolor

(48.0/1.0)| | petallength > 4.9| | | petalwidth <= 1.5: Iris-virginica (3.0)| | | petalwidth > 1.5: Iris-versicolor

(3.0/1.0)| petalwidth > 1.7: Iris-virginica (46.0/1.0)

04/10/23 15Machine learning with WEKA

Page 16: Wek1

Cross-validationCross-validationCorrectly Classified Instances 143

95.3%Incorrectly Classified Instances 7

4.67 %Default 10-fold cross validation i.e.

◦Split data into 10 equal sized pieces◦Train on 9 pieces and test on

remainder◦Do for all possibilities and average

04/10/23 16Machine learning with WEKA

Page 17: Wek1

J48 Confusion MatrixJ48 Confusion Matrix

Old data set from statistics: 50 of each class

a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica

04/10/23 17Machine learning with WEKA

Page 18: Wek1

Precision, Recall, and Precision, Recall, and AccuracyAccuracyPrecision: probability of being correct

given that your decision.◦Precision of iris-setosa is 49/49 = 100%◦Specificity in medical literature

Recall: probability of correctly identifying class.◦Recall accuracy for iris-setosa is 49/50 =

98%◦Sensitity in medical literature

Accuracy: # right/total = 143/150 =~95%

04/10/23 18Machine learning with WEKA

Page 19: Wek1

Machine learning with WEKA

Explorer - Explorer - ClusteringClusteringImplemented methods

◦k-Means◦EM◦Cobweb◦X-means◦FarthestFirst…

Clusters can be visualized and compared to “true” clusters (if given)

Evaluation based on loglikelihood if clustering scheme produces a probability distribution

04/10/23 19

Page 20: Wek1

04/10/23 Machine learning with WEKA 20

Explorer - AssociationsExplorer - AssociationsWEKA contains the Apriori algorithm

(among others) for learning association rules◦Works only with discrete data

Can identify statistical dependencies between groups of attributes:◦milk, butter bread, eggs (with confidence

0.9 and support 2000)Apriori can compute all rules that have

a given minimum support and exceed a given confidence

Page 21: Wek1

CONCEPT HIERARCY

Food

Milk Bread Fruit

2% Skimmed Fat Free Wheat White Apple Banana Orange

Inorganic Organic

Level 1

Multiple-Level Association Rule Mining in Weka

Page 22: Wek1

CONCEPT HIERARCY

Food

Milk Bread Fruit

2% Skimmed Fat Free Wheat White Apple Banana Orange

Inorganic Organic

Level 2

Multiple-Level Association Rule Mining in Weka

Page 23: Wek1

CONCEPT HIERARCY

Food

Milk Bread Fruit

2% Skimmed Fat Free Wheat White Apple Banana Orange

Inorganic Organic

Level 3

Multiple-Level Association Rule Mining in Weka

Page 24: Wek1

04/10/23

Sample Execution (1)Sample Execution (1)java weka.associations.Apriori -t

data/weather.nominal.arff -I yes

Apriori

=======

Minimum support: 0.2

Minimum confidence: 0.9

Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 12

24Machine learning with WEKA

Page 25: Wek1

04/10/23

Sample Execution (2)Sample Execution (2)

Best rules found:

1. humidity=normal windy=FALSE 4 ==> play=yes 4 (1)

2. temperature=cool 4 ==> humidity=normal 4 (1)

3. outlook=overcast 4 ==> play=yes 4 (1)

4. temperature=cool play=yes 3 ==> humidity=normal 3 (1)

5. outlook=rainy windy=FALSE 3 ==> play=yes 3 (1)

6. outlook=rainy play=yes 3 ==> windy=FALSE 3 (1)

7. outlook=sunny humidity=high 3 ==> play=no 3 (1)

8. outlook=sunny play=no 3 ==> humidity=high 3 (1)

25Machine learning with WEKA

Page 26: Wek1

Machine learning with WEKA

RegressionRegressionPredicted attribute is continuousImplemented methods

◦(linear regression)◦neural networks◦regression trees …

04/10/23 26

Page 27: Wek1

Machine learning with WEKA

Explorer - Explorer - Attribute Attribute SelectionSelectionVery flexible: arbitrary combination

of search and evaluation methodsBoth filtering and wrapping

methodsSearch methods

◦best-first◦genetic◦ranking ...

Evaluation measures◦ReliefF◦ information gain◦gain ratio …

04/10/23 27

Page 28: Wek1

04/10/23 Machine learning with WEKA 28

Explorer - Data Explorer - Data VisualizationVisualizationVisualization very useful in practice:

e.g. helps to determine difficulty of the learning problem

WEKA can visualize single attributes (1-d) and pairs of attributes (2-d)◦To do: rotating 3-d visualizations

(Xgobi-style)Color-coded class values“Jitter” option to deal with nominal

attributes (and to detect “hidden” data points)

“Zoom-in” function

Page 29: Wek1

04/10/23 Machine learning with WEKA 29

Performing experimentsPerforming experimentsExperimenter makes it easy to compare

the performance of different learning schemes

For classification and regression problemsResults can be written into file or databaseEvaluation options: cross-validation,

learning curve, hold-outCan also iterate over different parameter

settingsSignificance-testing built in!

Page 30: Wek1

04/10/23 Machine learning with WEKA 30

The Knowledge Flow GUIThe Knowledge Flow GUI

Java-Beans-based interface for setting up and running machine learning experiments

Data sources, classifiers, etc. are beans and can be connected graphically

Data “flows” through components: e.g.,“data source” -> “filter” -> “classifier” -> “evaluator”

Layouts can be saved and loaded again later

cf. Clementine ™

Page 31: Wek1

04/10/23 Machine learning with WEKA 31

Projects based on WProjects based on WEKAEKA 45 projects currently (30/01/07) listed on the

WekaWiki Incorporate/wrap WEKA

◦ GRB Tool Shed - a tool to aid gamma ray burst research

◦ YALE - facility for large scale ML experiments◦ GATE - NLP workbench with a WEKA interface◦ Judge - document clustering and classification◦ RWeka - an R interface to Weka

Extend/modify WEKA◦ BioWeka - extension library for knowledge

discovery in biology◦ WekaMetal - meta learning extension to WEKA◦ Weka-Parallel - parallel processing for WEKA◦ Grid Weka - grid computing using WEKA◦ Weka-CG - computational genetics tool library

Page 32: Wek1

04/10/23 Machine learning with WEKA 32

WWEKAEKA and P and PENTAHOENTAHOPentaho – The leader in Open Source

Business Intelligence (BI)September 2006 – Pentaho acquires the

Weka project (exclusive license and SF.net page)

Weka will be used/integrated as data mining component in their BI suite

Weka will be still available as GPL open source software

Most likely to evolve 2 editions:◦ Community edition◦ BI oriented edition

Page 33: Wek1

04/10/23 Machine learning with WEKA 33

Limitations of WLimitations of WEKAEKA

Traditional algorithms need to have all data in main memory

==> big datasets are an issueSolution:

◦Incremental schemes◦Stream algorithms

MOA “MMassive OOnline AAnalysis”(not only a flightless bird, but also extinct!)

Page 34: Wek1

SummarySummaryIntroduction to WEKAWEKA System HierarchyWEKA featuresBrief HistoryExplorerExperimenterCLIKnowledge FlowProject Based on WEKALimitations of WEKA

04/10/23 34Machine learning with WEKA

Page 35: Wek1

26/Sep/2006 S.P.Vimal, CS IS Group, BITS-Pilani 35

ReferencesReferences

1. Ian H. Witten and Eibe Frank (2005) "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco, 2005.

2. http://www.itl.nist.gov/div898/handbook/index.htm