intro to mahout -- dc hadoop

Intro to Apache Mahout

Grant IngersollLucid Imagination

http://www.lucidimagination.com

Anyone Here Use Machine Learning?

• Any users of:– Google?• Search?• Priority Inbox?

– Facebook?

– Twitter?

– LinkedIn?

Topics

• Background and Use Cases• What can you do in Mahout?• Where’s the community at?• Resources• K-Means in Hadoop (time permitting)

Definition

• “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E. Alpaydin

• Subset of Artificial Intelligence• Lots of related fields:

– Information Retrieval– Stats– Biology– Linear algebra– Many more

Common Use Cases

• Recommend friends/dates/products• Classify content into predefined groups• Find similar content• Find associations/patterns in actions/behaviors• Identify key topics/summarize text– Documents and Corpora

• Detect anomalies/fraud• Ranking search results• Others?

Apache Mahout

• An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License– http://mahout.apache.org

• Why Mahout?– Many Open Source ML libraries either:

• Lack Community• Lack Documentation and Examples• Lack Scalability• Lack the Apache License• Or are research-oriented

Definition: http://dictionary.reference.com/browse/mahout

http://mahout.apache.org/

http://dictionary.reference.com/browse/mahout

What does scalable mean to us?• Goal: Be as fast and efficient as possible given

the intrinsic design of the algorithm– Some algorithms won’t scale to massive machine

clusters– Others fit logically on a Map Reduce framework

like Apache Hadoop– Still others will need different distributed

programming models– Others are already fast (SGD)

• Be pragmatic

Sampling of Who uses Mahout?

https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

What Can I do with Mahout Right Now?

3C + FPM + O = Mahout

Collaborative Filtering

• Extensive framework for collaborative filtering (recommenders)

• Recommenders– User based– Item based

• Online and Offline support– Offline can utilize Hadoop

• Many different Similarity measures– Cosine, LLR, Tanimoto, Pearson, others

Clustering

• Document level– Group documents based

on a notion of similarity– K-Means, Fuzzy K-

Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)

– All Map/Reduce– Distance Measures

• Manhattan, Euclidean, other

• Topic Modeling – Cluster words across

documents to identify topics

– Latent Dirichlet Allocation (M/R)

Categorization• Place new items into

predefined categories:– Sports, politics,

entertainment– Recommenders

• Implementations– Naïve Bayes (M/R)– Compl. Naïve Bayes (M/R)– Decision Forests (M/R)– Linear Regression (Seq. but

Fast!)

•See Chapter 17 of Mahout in Action for Shop It To Me use case:• http://awe.sm/5FyNe

Freq. Pattern Mining

• Identify frequently co-occurrent items

• Useful for:– Query

Recommendations• Apple -> iPhone, orange,

OS X

– Related product placement• Basket Analysis

• Map/Reduce

http://www.amazon.com

Other

• Primitive Collections!

• Collocations (M/R)

• Math library– Vectors, Matrices, etc.

• Noise Reduction via Singular Value Decomp (M/R)

Prepare Data from Raw content

• Data Sources:– Lucene integration

• bin/mahout lucene.vector …

– Document Vectorizer• bin/mahout seqdirectory …• bin/mahout seq2sparse …

– Programmatically• See the Utils module in Mahout and the Iterator<Vector>

classes

– Database– File system

How to: Command Line

• Most algorithms have a Driver program– $MAHOUT_HOME/bin/mahout.sh helps with most tasks

• Prepare the Data– Different algorithms require different setup

• Run the algorithm– Single Node– Hadoop

• Print out the results or incorporate into application– Several helper classes:

• LDAPrintTopics, ClusterDumper, etc.

What’s Happening Now?

• Unified Framework for Clustering and Classification

• 0.5 release on the horizon (May?)• Working towards 1.0 release by focusing on:– Tests, examples, documentation– API cleanup and consistency

• Gearing up for Google Summer of Code– New M/R work for Hidden Markov Models

Summary

• Machine learning is all over the web today

• Mahout is about scalable machine learning

• Mahout has functionality for many of today’s common machine learning tasks

• Many Mahout implementations use Hadoop

Resources• http://mahout.apache.org• http://cwiki.apache.org/MAHOUT• {user|dev}@mahout.apache.org• http://svn.apache.org/repos/asf/mahout/trunk• http://hadoop.apache.org

http://mahout.apache.org/

http://cwiki.apache.org/MAHOUT

mailto:mahout-%7Buser%7Cdev%[email protected]

http://svn.apache.org/repos/asf/lucene/mahout/trunk

http://hadoop.apache.org/

Resources

• “Mahout in Action” – Owen, Anil, Dunning and Friedman– http://awe.sm/5FyNe

• “Introducing Apache Mahout” – http://www.ibm.com/developerworks/java/library/j-mahout/

• “Taming Text” by Ingersoll, Morton, Farris• “Programming Collective Intelligence” by Toby Segaran• “Data Mining - Practical Machine Learning Tools and

Techniques” by Ian H. Witten and Eibe Frank• “Data-Intensive Text Processing with MapReduce” by Jimmy

Lin and Chris Dyer

http://www.ibm.com/developerworks/java/library/j-mahout/

K-Means• Clustering Algorithm– Nicely parallelizable!

http://en.wikipedia.org/wiki/K-means_clustering

K-Means in Map-Reduce• Input:

– Mahout Vectors representing the original content– Either:

• A predefined set of initial centroids (Can be from Canopy)• --k – The number of clusters to produce

• Iterate– Do the centroid calculation (more in a moment)

• Clustering Step (optional)• Output

– Centroids (as Mahout Vectors)– Points for each Centroid (if Clustering Step was taken)

Map-Reduce Iteration

• Each Iteration calculates the Centroids using:– KMeansMapper– KMeansCombiner– KMeansReducer

• Clustering Step– Calculate the points for each Centroid using:– KMeansClusterMapper

KMeansMapper

• During Setup:– Load the initial Centroids (or the

Centroids from the last iteration)• Map Phase– For each input

• Calculate it’s distance from each Centroid and output the closest one

• Distance Measures are pluggable– Manhattan, Euclidean, Squared

Euclidean, Cosine, others

KMeansReducer

• Setup:– Load up clusters– Convergence information– Partial sums from

KMeansCombiner (more in a moment)

• Reduce Phase– Sum all the vectors in the

cluster to produce a new Centroid

– Check for Convergence• Output cluster

KMeansCombiner

• Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper

KMeansClusterMapper

• Some applications only care about what the Centroids are, so this step is optional

• Setup:– Load up the clusters and the DistanceMeasure

used• Map Phase– Calculate which Cluster the point belongs to– Output <ClusterId, Vector>

intro to mahout -- dc hadoop

Technology

apache licenseor

apache hadoopstill

apache software licensehttp

orgwhy mahout

definitionmachine learning

use caseswhat

massive machine clustersothers

scalable mean