mahout and distributed machine learning 101
DESCRIPTION
Brief introduction to Mahout and distributed machine learning presented to Orlando Data ScienceTRANSCRIPT
Introduction to machine learning
with mahoutJohn Ternent
@jaternent
Orlando Data Science – www.orlandods.com
May 13, 2014
Welcome!
Updates
Social Media
Facebook.com/orlandodata Twitter.com/orlandodata LinkedIn
OrlandoDS.com
Social Network Forum Articles and Content And More
Send articles to: [email protected]
Orlando Wiki
Completely Open Aggregate Learning Resources! Go NUTS
May 28th Event
Full Sail, UCF, and Florida Polytechnic
Submit Your Questions! @orlandodata
Member Survey
Need n=30!!! OrlandoDS.com/member-survey OR: find it in our past meetup
announcements
Learn Hadoop
First Class: June 3rd.
Location: Here
Future Plans
Establish Non-Profit Increase Global Following Become Strong Networking and
Education Resource for YOU
A (very) little bit about me… Consultant (Management & Technology) Open Source Evangelist Full-spectrum data nerd
A little about you!
Rate yourself (1 – 10) on Mahout Rate yourself (1 – 10) on Machine
Learning/Data Mining Rate yourself (1 – 10) on Big
Data/Hadoop
Please wait… optimizing presentation…
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
-- Tom M. Mitchell, 1997
Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to … an economic advantage.
-- Ian H. Witten & Eibe Frank, 2005
If you’re in academia, you call it “machine learning.” If you’re in business, you call it “data mining.” Mark Hall
I create or improve general
purpose algorithms for
machine learning
I use multiple machine learning
algorithms for practical data
discovery
Source : xkcd Source : xkcd
Machine Learning Uses
Clustering
Classification
Recommendation
Machine Learning Algorithms Regression K-means Clustering K-NN CART Neural Networks Support Vector Machines Association Rules
Principal Component Analysis Singular Value Decomposition Ensemble Methods Naïve Bayes …
Real-World Applications Recommender Systems Image recognition Signal Processing Propensity to buy/churn Fraud analysis Text analytics Spam filtering Forecasting methods Revenue management …
The Problem … and Opportunity
Big Data™If you have to choose, having more data does indeed trump a better algorithm. However, what is better than just having more data on its own is also having an algorithm that annotates the data with new linkages and statistics which alter the underlying data asset.”- Omar Tawakol
Weka Explorer can handle ~1M instances, 25 attributes (50 MB file)- Ian Witten
Potential Solutions
Expand RAM Use incremental algorithms Use distributable algorithms
Scale Up
Scale Out
Hadoop in 30 seconds
Input
Input
Input
Input
Input
Input
Input
Map (K,V)
Map (K,V)
Map (K,V)
Map (K,V)
Shuffle / Sort
Reduce
Reduce
Reduce
Output
Output
Output
Finally -- Mahout
A Java-based library of machine learning algorithms designed to support distributed processing
Initially on MapReduce, now leaning heavily towards Spark
Primarily focused on Recommenders, Clustering, and Classification spaces.
Running Mahout Locally – download mahout distro.
/bin/mahout is the wrapper script, default shows all the example programs available.
Lots of tools included to convert data into vector formats and pre-process text, worth a look
Amazon EC2 Configure stack from scratch on EC2 servers
Amazon EMR Quicker start, a lot of the build is already optimized
for MapReduce jobs, just add Mahout as a custom jar and pass the script as a parameter
Running Recommenders
Multiple Recommender AlgorithmsUser-basedItem-based
A Recommender Needs:DataModel (e.g. FileDataModel)Similarity driver (PearsonCorrelationSimilarity)Neighborhood (NearstNUserNeighborhood,
ThresholdUserNeighborhood)Recommender
Running Recommenders
Tip : If you have no preferences, there are Boolean equivalents of the recommender classes
Evaluate user vs. item similarities Example
Clustering Algorithms
To cluster you need:Location in n-dimensional spaceDistance metricThreshold
K-means Canopy Dirichlet Fuzzy K-means Spectral Clustering
Clustering
Clustering Text
Identify k topics in a document corpus Requires conversion of text into vector Lucene utilities are available to vectorize
text and apply stop-word or weighting criteria.
Seqdirectory – from a directory of text files
Lucene.vector – from a Lucene index
Classifiers
NaïveBayes RandomForests LogisticRegression (SGD) HiddenMarkov Example : 20 Newsgroups
Sidebar : Risks of Big Data Unsupervised Learning