introduction predictive analytics tools: weka, r of data... · university of california, san diego...
TRANSCRIPT
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Introduction Predictive Analytics Tools: Weka, R!
Predictive Analytics Center of Excellence
San Diego Supercomputer Center University of California, San Diego
!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Available Data Mining Tools!COTs:!
n IBM Intelligent Miner!n SAS Enterprise Miner!n Oracle ODM!n Microstrategy!n Microsoft DBMiner!n Pentaho!n Matlab!n Teradata!
Open Source:!n WEKA!n KNIME!n Orange!n RapidMiner!n NLTK!n R!n Rattle!
2
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Agenda!!• WEKA!
• Intro and background"• Data Preparation"• Creating Models/ Applying Algorithms"• Evaluating Results"
• R!• R Background"• R Basics"
• Outline"• R-Studio Overview"
• Hands On (homework)"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
WEKA!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Download and Install WEKA!
• Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html!
!
5 7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
What is WEKA?!• Waikato Environment for Knowledge Analysis!
• WEKA is a data mining/machine learning application developed by Department of Computer Science, University of Waikato, New Zealand"
• WEKA is open source software in JAVA "• WEKA is a collection machine learning algorithms and tools for data
mining tasks"• data pre-processing, classification, regression, clustering, association,
and visualization. "• WEKA is well-suited for developing new machine learning
schemes "• WEKA is a bird found only in New Zealand. !
6 7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Advantages of Weka !• Free availability !
• under the GNU General Public License"• Portability!
• fully implemented in the Java programming language and thus runs on almost any modern computing platforms"
• Windows, Mac OS X and Linux"• Comprehensive collection of data preprocessing and modeling
techniques!• Supports standard data mining tasks: data preprocessing, clustering,
classification, regression, visualization, and feature selection."• Easy to use GUI!• Provides access to SQL databases !
• using Java Database Connectivity and can process the result returned by a database query."
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Disadvantages !!
• Sequence modeling is not covered by the algorithms included in the Weka distribution!
• Not capable of multi-relational data mining!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
WEKA Walk Through: Main GUI!• Three graphical user interfaces!
• “The Explorer” (exploratory data analysis)"• pre-process data"• build “classifiers” "• cluster data"• find associations"• attribute selection"• data visualization"
• “The Experimenter” (experimental environment)"• used to compare performance of different learning
schemes "• “The KnowledgeFlow” (new process model
inspired interface) "• Java-Beans-based interface for setting up and running
machine learning experiments."• Command line Interface (“Simple CLI”)!
9 7/1/14 More at: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER 10
7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: Explorer: Preprocess!• Importing data !
• Data format"• Uses flat text files to describe the data"• Data can be imported from a file in various formats: "
• ARFF, CSV, C4.5, binary"• Data can also be read from a URL or from an SQL
database (using JDBC)"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: ARFF file format!@relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present} @data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present ...!
A more thorough description is available here http://www.cs.waikato.ac.nz/~ml/weka/arff.html
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER 13
University of Waikato 7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER 14
University of Waikato 7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Weka: Explorer:Preprocess!
• Preprocessing data !• Visualization"• Filtering algorithms "
• filters can be used to transform the data (e.g., turning numeric attributes into discrete ones) and make it possible to delete instances and attributes according to specific criteria."
• Removing Noisy Data"• Adding Additional Attributes"• Remove Attributes"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: Explorer: Preprocess!• Used to define filters to transform
Data. !• WEKA contains filters for:!
• Discretization, normalization, resampling, attribute selection, transforming, combining attributes, etc"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER 19
University of Waikato 7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Explorer: Visualize!
• Visualization very useful in practice!• help determine difficulty of the learning problem"
• WEKA can visualize single attributes (1-d) and pairs of attributes (2-d)!
• Color-coded class values!• “Jitter” option to deal with nominal attributes
(and to detect “hidden” data points)!• “Zoom-in” function!
22 7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER 23
University of Waikato 7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER 24
University of Waikato 7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Explorer: Attribute Selection!
• Panel that can be used to investigate which (subsets of) attributes are the most predictive ones!
• Attribute selection methods contain two parts:!• A search method: best-first, forward selection, random,
exhaustive, genetic algorithm, ranking!• An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …"• Very flexible: WEKA allows (almost) arbitrary
combinations of these two!
7/1/14 25
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: Explorer: building “classifiers”!
• Classifiers in WEKA are models for predicting nominal or numeric quantities!
• Implemented learning schemes include:!• Decision trees and lists, instance-based
classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …"
• “Meta”-classifiers include:!• Bagging, boosting, stacking, error-correcting
output codes, locally weighted learning, … "
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER 27
University of Waikato 7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER 28
University of Waikato 7/1/14
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
WEKA:: Explorer: building “Cluster”!
• WEKA contains “clusters” for finding groups of similar instances in a dataset!
• Implemented schemes are:!• k-Means, EM, Cobweb, X-means, FarthestFirst"
• Clusters can be visualized and compared to “true” clusters (if given)!
• Evaluation based on loglikelihood if clustering scheme produces a probability distribution!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Explorer: Finding associations!
• WEKA contains an implementation of the Apriori algorithm for learning association rules!• Works only with discrete data"
• Can identify statistical dependencies between groups of attributes:!• milk, butter bread, eggs (with confidence 0.9 and
support 2000)"• Apriori can compute all rules that have a given
minimum support and exceed a given confidence!
7/1/14 30
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
References and Resources!
• References:!• WEKA website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html"• WEKA Tutorial:"
• Machine Learning with WEKA: A presentation demonstrating all graphical user interfaces (GUI) in Weka. "
• A presentation which explains how to use Weka for exploratory data mining. "
• WEKA Data Mining Book:"• Ian H. Witten and Eibe Frank, Data Mining: Practical Machine
Learning Tools and Techniques (Second Edition)"• WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/
Main_Page"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
R Environment: R Studio!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Downloading R/ R Studio!• http://www.r-project.org/!
• http://www.rstudio.com/ide/download/!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
What is R? !!• An Environment!
• R is an integrated suite of software facilities for data manipulation, calculation and graphical facilities for data analysis and display. "
• Effective data handling and storage"• Suite of operators for calculations on arrays"• Large, coherent, integrated collection of intermediate tools for data analysis "• Programming language, run time environment"
• Developed at Bell Labs!• GNU open source software!
• Under the terms of the Free Software Foundation's GNU General Public License"
• Open Source implementation of S-Plus language!• Well-developed, simple and effective programming language"
• Highly extensible!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
R Features!• Software package designed for data analysis and graphical representation!• Interactive, but may also be used programmatically!• Platform independence!
• Compiles and runs on a wide variety platforms, Unix base, Windows and MacOS. "• Free, open source code!• Engaged community!
• over 4,200 user-contributed packages"• Extendable!
• User defined functions"• > 4000 packages available in the CRAN package repository"
• Supports extensions / add-ons (i.e. – rApache)"• Compatible with other languages (i.e. – SQL, perl, C)"• Data Import"
• Pre-processing data from different sources"• Scalability!
• Parallel R packages "
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
R packages for DM!
• Clustering !• Classification!• Association Rules !• Sequential patterns!• Time Series!• Statistics!• Graphics!• Data manipulation!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Data Mining!
• linear models (lm)!• generalized linear
models(glm)!• generalized additive
models (gam)!• linear mixed effects
models(lme)!• quantile regression (qr)!• vector general additive
models(vgam)!• lasso, ridge, and elastic
net models (glmnet)!• non-linear models (nlm)!
• linear mixed effects models (nlmer)!
• linear discriminant analysis (lda)!
• quadratic discriminate analysis (qda)!
• trees (tree)!• random forests
(randomForrest)!• support vector machines
(svm)!• neural networks (nnet)!• k-nearest neighbors (knn)!• kmeans!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Big Data Options!• lapply-based parallelism!
• multicore library"• snow library"
• foreach-based parallelism!• doMC backend"• doSNOW backend"• doMPI backend"
• Map/Reduce- (Hadoop-) based parallelism!• Hadoop streaming with R mappers/reducers"• Rhadoop (rmr, rhdfs, rhbase)"• RHIPE"
• Poor-man's Parallelism!• lots of Rs running"• lots of input files"
• Hands-off Parallelism!• OpenMP support compiled into R build"• Dangerous!"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
R Considerations/Limitations!
• Command Line Interface!• Performance!• Memory Limits!
• memory limits dependent on the build, (32-bit vs. 64-bit)"• 32-bit build of R on Windows is dependent on the
underlying OS version"• Syntax “curiosities”!• Learning curve!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
!
R-Studio Overview!• http://www.rstudio.com/ide/download/ • R-Studio is an integrated development environment to
support R code. • R-Studio runs in two ways:
• Desktop version for Linux, Mac, Windows: Single user, perfect for laptop or desktop machine
• Server Version for Linux: Allows an number of remote users to run R-Studio within a web-browser, facilitates sharing of code and data among team members
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
“pop-up”:!Multi-tab display: !Shows graphics, !Current directory and !loaded packages!
Project Window:!Currently loaded !Workspace, and !history!
Console: Run R! Commands!
Editor Window!
• General View of R-Studio
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
The Fundamentals !!• Launch R!• Quit R!
• q() "• Getting Help!
• help(package_name) or ?(package_name) or help start()"• example(package_name)"• ??(keyword)"• library(help=“package_name”)"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
The Basics!• R environmental commands!
• list objects"• ls() "• objects()"
• list files in current directory"• list.files()"
• list current directory"• getwd()"
• set working directory"• setwd()"
• remove objects"• rm()"
• Workspace versus console!• Clear workspace"
• rm(list=ls())"• Clear console"
• (control, L)"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
The Basics(Naming Variables)!
• Requirements!• Case sensitive, names must start with letter or '.’"• Only letters, numbers, underscores and‘.’s"
• Special keywords!• break, else, FALSE, for, function, if, Inf, NA, NaN, next,
repeat, return, TRUE, while"• Names not limited in length!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
The Basics!
• All entities in are called “objects”!• arrays, vectors, matrices, functions, lists, data frames, factors"
• Expressions vs. assignments!• 10+10"• my.age <- 23"• my.age < - 23 (note the added space)"• age<- c(my.age, 14, 59, 32)"• my.age == 40"
• Data Types!• Numeric, Integer, Complex, Logical, Character"
• Function call!!> mean(weight)"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Summary of Data Structures!
Linear! Rectangular!Homogeneous" Vectors" Matrices"Heterogeneous" Lists" Data Frames"
"
• Vectors and Matrices must contain same data type!• Character Type will trump numeric: Values will be
forced into characters!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
The Basics(Functions)!
• Basic functions!• mean(age)"• sd(age)"• sqrt(var(age))"
• TIP: to list all function in search path"– sapply(search(), ls, all.names = TRUE)
• User Defined functions!• Score <- age * 10;"
• Using the correct functions for the given data type!• apply() family "
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Function Components!
writeLines(text=“text”, con = stdout(), sep = "\n", useBytes = FALSE)!• function name: writeLines(“146.6”, “popRate.txt”, sep =
"\n”)"• parentheses: writeLines(“146.6”, “popRate.txt”, sep = "\n”)!• commas: writeLines(“146.6”, “popRate.txt”, sep = "\n”)"• first argument: writeLines(“146.6”, “popRate.txt”, sep =
"\n”)"• second argument: writeLines(“146.6”, “popRate.txt”, sep =
"\n”)""• optional argument: writeLines(“146.6”, “popRate.txt”, "\n”)"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Importing Data/Exporting Data!• Flat Files!
• Import: > AHW <- read.csv(“AHW_1.csv”, header=TRUE)" >weatherdata <- read.table(file="C:/work/DM1/weather.csv", header=TRUE, sep=",") "• Export: > USTemps=read.table(file=file.choose(),header=TRUE)"
• Databases!• Import"
• connection <- dbConnect(driver, user, password, host, dbname)"> AHW <- dbSendQuery(connection, “SELECT * FROM AHW”)
• Export"• > connnection <- dbConnect(driver, user, password, host,dbname)"
> dbWriteTable (con, “AHW”, AHW) • R objects!
• Import: > load(‘AHW.Rdata’)"• Export: > save(AHW, file=“New_AHW.Rdata”)"
• Web!• connection <-url(‘http://pace.sdsc.edu/sites/default/bootcamp/images/AHW_1.csv’)"• AHW <- read.csv(con, header=TRUE)"
• Plots!• png(filename="C:/R/figure.png", height=295, width=300, bg="white")"• pdf(file="C:/R/figure.pdf", height=3.5, width=5)"
• Dev.off() #turn off device driver (to flush output to png/pdf)"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Name of data frame!to be created with !imported data!
Options for parsing !the text data into !fields and values!
How data frame will !look once the data !
are imported!
• Loading dataset to R-Studio (Simple text file)
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Extending R!• http://cran.r-project.org/web/packages/!
• Install a package !• from command line"
"> install.package(‘name_of_package’)"• from GUI"
• Packages & Data > Package Installer"• Load Library (to use installed package)"
• > library(name_of_package)"• Example "
> library(markdown)"• Use Library Function!
• > function_name(parameters)"• Example "
> markdownToHTML("example.md")"
"http://www.r-bloggers.com/dont-r-alone-a-guide-to-tools-for-collaboration-with-r/!!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
More Information……!• The R Manuals!
• http://www.stat.berkely.edu/~spector/R.pdf"• And Introduction to R !
• http://cran.r-project.org/doc/manuals/R-intro.html"• http://tryr.codeschool.com/"
• Books!
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Other Resources!
/server irc.freenode.net/join #R!"
UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
the end!!