intro to data mining webinar
DESCRIPTION
Data Mining webinar for allTRANSCRIPT
![Page 1: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/1.jpg)
Revolution Confidential
Introduc tion to R for Data Mining
2012 S pring Webinar S eries
J os eph B . R ic kert, R evolution A nalytic sJ une 5, 2012
1
![Page 2: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/2.jpg)
Revolution ConfidentialG oals for Today’s Webinar
2
R is a serious platform for data mining
Seriously, it is not difficult to
learn enough R to do some serious data
mining
To convince you that:
Revolution R Enterprise is
is the platform for serious
data mining
![Page 3: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/3.jpg)
Revolution ConfidentialData Mining
3
Applications
Credit Scoring
Fraud Detection
Ad Optimization
Targeted Marketing
Gene Detection
Recommendation systems
Social Networks
Actions
Acquire Data
Prepare
Classify
Predict
Visualize
Optimize
Interpret
Algorithms
CART
Random Forests
SVM
KMeans
Hierarchical clustering
Ensemble Techniques
![Page 4: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/4.jpg)
Revolution Confidential
R ec ent K DD Nuggets P oll s ugges ts s o are a lot of other s erious data miners
4
What Analytics, Data mining, Big Data software you used in the past 12 months for a real project (not just evaluation) [798 voters]
Software % users in 2012 % users in 2011
R (245) 30.7% 23.3%
Excel (238) 29.8% 21.8%
Rapid-I RapidMiner (213) 26.7% 27.7%
KNIME (174) 21.8% 12.1%
Weka / Pentaho (118) 14.8% 11.8%
StatSoft Statistica (112) 14.0% 8.5%
SAS (101) 12.7% 13.6%
Rapid-I RapidAnalytics (83) 10.4% Not asked in 2011
MATLAB (80) 10.0% 7.2%
IBM SPSS Statistics (62) 7.8% 7.2%
IBM SPSS Modeler (54) 6.8% 8.3%
SAS Enterprise Miner (46) 5.8% 7.1%
![Page 5: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/5.jpg)
Revolution Confidential
WHAT DOE S IT ME A N TO L E A R N R ?
Learning R
5
![Page 6: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/6.jpg)
Revolution ConfidentialWhat does it mean to learn F renc h?
6
To read a Menu
To get around Paris on the Metro
To carry on a conversation
![Page 7: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/7.jpg)
Revolution ConfidentialL earning R
7
Levels of R Skill
R developer
R contributor
R programmer
R user
R aware
Hours of use
10 10,000
The Malcolm Gladwell “Outlier” Scale
Use a GUI
Use R Functions
Write functions
Write an R package
Write production level code
![Page 8: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/8.jpg)
Revolution Confidential
T HE S T R UC T UR E OF R FA C IL ITAT E S L E A R NING
Productive from the Get go!
8
![Page 9: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/9.jpg)
Revolution ConfidentialR is s et up to compute functions on data
9
lm <- function(x,y){. . . }
lm.modellm.model$assignlm.model$coefficientslm.model$df.residuallm.model$effectslm.model$fitted.values
.
.
.
![Page 10: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/10.jpg)
Revolution ConfidentialA little knowledge goes a long way in R R’s functional design facilitates
performing small tasks For the most part, the output of a
function depends only on the values of its arguments
calling a function multiple times with the same values of its arguments will produce the same result each time
Minimal side effects means it is much easier to understand and predict the behavior of a program
10
The trick is knowing which functions to call
![Page 11: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/11.jpg)
Revolution ConfidentialB as ic Mac hine L earning F unc tions
11
Function Library DescriptionCluster hclust stats Hierarchical cluster analysis
kmeans stats Kmeans clusteringClassifiers glm stats Logistic Regression
rpart rpart Recursive partitioning and regression trees
ksvm kernlab Support Vector MachineEnsemble ada ada Stochastic boosting
randomForest randomForest Random Forests classification and regression
![Page 12: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/12.jpg)
Revolution ConfidentialNoteworthy Data Mining P ac kages
12
Package Commentrattle A very intuitive GUI for data mining that
produces useful R codecaret Well organized and remarkably complete
collection of functions to facilitate model building for regression and classification problems
![Page 13: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/13.jpg)
Revolution Confidential
T IME TO R UN S OME C ODEDoing a lot with a little R
13
![Page 14: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/14.jpg)
Revolution ConfidentialS c ripts to run
14
Script Some key Functions0 Setup Load libraries1 Explore weather data Read.csv, plot2 Run clustering algorithms kmeans, hclust3 Basic decision tree rpart4 Boosted Tree ada5 Random Forest randomForest6 Support Vector Machine randomForest, varImpPlot7 Big Data Mortgage Default
modelrxLogit, rxKmeans
![Page 15: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/15.jpg)
Revolution ConfidentialB ig Data and R
There are some challenges: All of your data and model code must fit into
memory Big data sets as well as big models (lots of
variables) can run out of memory Parallel computation might be necessary for
models to run in a reasonable time
15
![Page 16: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/16.jpg)
Revolution ConfidentialR evoS caleR in R evolution R E nterpris e
Can help in a number of ways: Manipulate large data sets, and perhaps
aggregating data so that it will fit in memory For example, boiling down time-stamped data
like a web log to form a time series that will fit in memory
Run RevoScaleR Functions directly on big data sets Run R functions in parallel
16
![Page 17: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/17.jpg)
Revolution Confidential
Top R evoS caleR F unctions for Data Miningparallel external memory algorithms
17
Task RevoScaleR functionData processing rxDataStepDescriptive Statistics rxSumaryTables and cubes rxCube, rxCrosstabsCorrelations / covariance rxCovCor, rxCor, rxCov,
rxSSCPLinear Models rxLinModLogistic regressions rxLogitGeneralized linear models rxGlmK means clustering rxKmeansPredictions (scoring) rxPredict
![Page 18: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/18.jpg)
Revolution Confidential
WHE R E TO G O F R OM HE R E ?More than code, R is a community
18
![Page 19: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/19.jpg)
Revolution ConfidentialF inding your way around the R world
Machine Learning Data Mining Visualization Finding Packages
Task Views crantastic.org
Blogs Revolutions R-Bloggers Quick-R
Getting Help StackOverflow @RLangTip Inside-R www.rseek.org
Finding R People User Groups worldwide #rstats
19
Word Cloud for @inside_R
![Page 20: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/20.jpg)
Revolution ConfidentialL ook at s ome more s ophis ticated examples
Thomson Nguyen on the Heritage Health Prize Shannon Terry & Ben Ogorek (Nationwide Insurance):
A Direct Marketing In-Flight Forecasting System Jeffrey Breen:
Mining Twitter for Airline Consumer Sentiment Joe Rothermich: Alternative Data Sources for Measuring
Market Sentiment and Events (Using R)
20
![Page 21: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/21.jpg)
Revolution ConfidentialR evolution A nalytic s Training
21
http://www.revolutionanalytics.com/products/training/
![Page 22: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/22.jpg)
Revolution ConfidentialR eferenc es
22
![Page 23: Intro to Data Mining Webinar](https://reader030.vdocuments.site/reader030/viewer/2022033010/577cc16a1a28aba71192f8a8/html5/thumbnails/23.jpg)
Revolution Confidential
Revolution Confidential
23