mahout and distributed machine learning 101

29
Introduction to machine learning with mahout John Ternent @jaternent Orlando Data Science – www.orlandods.com May 13, 2014

Upload: john-ternent

Post on 15-Jan-2015

96 views

Category:

Technology


0 download

DESCRIPTION

Brief introduction to Mahout and distributed machine learning presented to Orlando Data Science

TRANSCRIPT

Page 1: Mahout and Distributed Machine Learning 101

Introduction to machine learning

with mahoutJohn Ternent

@jaternent

Orlando Data Science – www.orlandods.com

May 13, 2014

Page 2: Mahout and Distributed Machine Learning 101
Page 3: Mahout and Distributed Machine Learning 101

Welcome!

Updates

Page 4: Mahout and Distributed Machine Learning 101

Social Media

Facebook.com/orlandodata Twitter.com/orlandodata LinkedIn

Page 5: Mahout and Distributed Machine Learning 101

OrlandoDS.com

Social Network Forum Articles and Content And More

Send articles to: [email protected]

Page 6: Mahout and Distributed Machine Learning 101

Orlando Wiki

Completely Open Aggregate Learning Resources! Go NUTS

Page 7: Mahout and Distributed Machine Learning 101

May 28th Event

Full Sail, UCF, and Florida Polytechnic

Submit Your Questions! @orlandodata

Page 8: Mahout and Distributed Machine Learning 101

Member Survey

Need n=30!!! OrlandoDS.com/member-survey OR: find it in our past meetup

announcements

Page 9: Mahout and Distributed Machine Learning 101

Learn Hadoop

First Class: June 3rd.

Location: Here

Page 10: Mahout and Distributed Machine Learning 101

Future Plans

Establish Non-Profit Increase Global Following Become Strong Networking and

Education Resource for YOU

Page 11: Mahout and Distributed Machine Learning 101

A (very) little bit about me… Consultant (Management & Technology) Open Source Evangelist Full-spectrum data nerd

Page 12: Mahout and Distributed Machine Learning 101

A little about you!

Rate yourself (1 – 10) on Mahout Rate yourself (1 – 10) on Machine

Learning/Data Mining Rate yourself (1 – 10) on Big

Data/Hadoop

Please wait… optimizing presentation…

Page 13: Mahout and Distributed Machine Learning 101

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

-- Tom M. Mitchell, 1997

Data mining is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic. The patterns discovered must be meaningful in that they lead to … an economic advantage.

-- Ian H. Witten & Eibe Frank, 2005

Page 14: Mahout and Distributed Machine Learning 101

If you’re in academia, you call it “machine learning.” If you’re in business, you call it “data mining.” Mark Hall

I create or improve general

purpose algorithms for

machine learning

I use multiple machine learning

algorithms for practical data

discovery

Source : xkcd Source : xkcd

Page 15: Mahout and Distributed Machine Learning 101

Machine Learning Uses

Clustering

Classification

Recommendation

Page 16: Mahout and Distributed Machine Learning 101

Machine Learning Algorithms Regression K-means Clustering K-NN CART Neural Networks Support Vector Machines Association Rules

Principal Component Analysis Singular Value Decomposition Ensemble Methods Naïve Bayes …

Page 17: Mahout and Distributed Machine Learning 101

Real-World Applications Recommender Systems Image recognition Signal Processing Propensity to buy/churn Fraud analysis Text analytics Spam filtering Forecasting methods Revenue management …

Page 18: Mahout and Distributed Machine Learning 101

The Problem … and Opportunity

Big Data™If you have to choose, having more data does indeed trump a better algorithm. However, what is better than just having more data on its own is also having an algorithm that annotates the data with new linkages and statistics which alter the underlying data asset.”- Omar Tawakol

Weka Explorer can handle ~1M instances, 25 attributes (50 MB file)- Ian Witten

Page 19: Mahout and Distributed Machine Learning 101

Potential Solutions

Expand RAM Use incremental algorithms Use distributable algorithms

Scale Up

Scale Out

Page 20: Mahout and Distributed Machine Learning 101

Hadoop in 30 seconds

Input

Input

Input

Input

Input

Input

Input

Map (K,V)

Map (K,V)

Map (K,V)

Map (K,V)

Shuffle / Sort

Reduce

Reduce

Reduce

Output

Output

Output

Page 21: Mahout and Distributed Machine Learning 101

Finally -- Mahout

A Java-based library of machine learning algorithms designed to support distributed processing

Initially on MapReduce, now leaning heavily towards Spark

Primarily focused on Recommenders, Clustering, and Classification spaces.

Page 22: Mahout and Distributed Machine Learning 101

Running Mahout Locally – download mahout distro.

/bin/mahout is the wrapper script, default shows all the example programs available.

Lots of tools included to convert data into vector formats and pre-process text, worth a look

Amazon EC2 Configure stack from scratch on EC2 servers

Amazon EMR Quicker start, a lot of the build is already optimized

for MapReduce jobs, just add Mahout as a custom jar and pass the script as a parameter

Page 23: Mahout and Distributed Machine Learning 101

Running Recommenders

Multiple Recommender AlgorithmsUser-basedItem-based

A Recommender Needs:DataModel (e.g. FileDataModel)Similarity driver (PearsonCorrelationSimilarity)Neighborhood (NearstNUserNeighborhood,

ThresholdUserNeighborhood)Recommender

Page 24: Mahout and Distributed Machine Learning 101

Running Recommenders

Tip : If you have no preferences, there are Boolean equivalents of the recommender classes

Evaluate user vs. item similarities Example

Page 25: Mahout and Distributed Machine Learning 101

Clustering Algorithms

To cluster you need:Location in n-dimensional spaceDistance metricThreshold

K-means Canopy Dirichlet Fuzzy K-means Spectral Clustering

Page 26: Mahout and Distributed Machine Learning 101

Clustering

Page 27: Mahout and Distributed Machine Learning 101

Clustering Text

Identify k topics in a document corpus Requires conversion of text into vector Lucene utilities are available to vectorize

text and apply stop-word or weighting criteria.

Seqdirectory – from a directory of text files

Lucene.vector – from a Lucene index

Page 28: Mahout and Distributed Machine Learning 101

Classifiers

NaïveBayes RandomForests LogisticRegression (SGD) HiddenMarkov Example : 20 Newsgroups

Page 29: Mahout and Distributed Machine Learning 101

Sidebar : Risks of Big Data Unsupervised Learning